Skip to content

Data

In machine learning data plays a crucial role. No matter how great an algorithm is if the data provided is not sufficiently good, result will be a poor model. On the other hand with the correct data even with the "average" algorithm a model with high precision can be created. There are few points to consider when validating the data.

Data has to be clean, meaning: 1. It must be complete - there should not be any empty field. 2. Accuracy of it should be audited and fixed if there is any inconsistency. 3. Similar data should have the same format. For example times/events all must be in 'timestamp' format.

Data availability: Usually teams in a project makes several experiments and thus, they need available and easily accessible data to do it. Therefore, data should be kept in a place such as data lake or warehouse where every member of a team can access it.

Data Relevancy: For machine learning project data has to be relevant to the problem under consideration. There should be enough data sufficiently explaining the domain in which the problem exists. For example, in an anomaly detection project information about only card issuer and the location of the transaction might not be enough. Additional data such as transaction amount, date, etc. would result in much better performance.

Data Bias: A machine learning algorithm only learns from the data it was provided. Therefore if there is a bias in the data it will lead to unexpected results. There is a famous example in which the task was to distinguish US tanks from the Russian tanks. In the lab the project seemed promising with high accuracy, but in the field it failed. The reason was that when training the model they used the biased data, in which US tank images were all taken in a sunny day and the Russian tank images were taken in a cloudy day. Hence, the model learned to distinguish sunny day from the cloudy day not the friendly tanks from the enemy tanks.

Data Time Window: When data which captured in a wrong time period is used to train a model it wont be able to produce good result due to lack of the variance in the data. For example if it is a stock data and it only considers last month, it will not understand the dynamics of the stock market. In this case having few years of data would include a lot of different recurring events which can be useful. Having too long of a window can be a problem too. It can lead model to learn outdated information, hence perform poorly.

4 thoughts on “Data

  1. skaisler

    Habil:

    While data should have the properties that you mention in your blog, rarely is the data that you get pristeen.

    So, as you think about developing a model, you should think about how to detect background noise and how to represent it in your model.
    Interview data may be collected with the microphone lcose to the speaker's mouth, so it mat be amplified over the backgroun noise. Different kinds of filters can help to remove some of that noise, but you will have to experiment with a few of them to see:
    1) how the background noise is characterized, and
    2) how much can be removed from original recording.

    This will require both frequency and spectral analysis, I think.

    Reply
    1. hgadirli

      Thank you Professor Kaisler for your comment. I will consider noise in my project. In images as I know we can remove some of the noise by applying blur filter.

      Reply
  2. pless

    Another comment: This is a great post about why data is important. But I'm worried that you don't actually show any specific data. I know that you are still in the process of figuring out what you might work on as a project. But for the purpose of this post --- pick a possible project and show exactly what data you think you could use to attack that problem!

    Reply
    1. hgadirli

      Thank you Professor Pless for your comment. I will add specific data as soon as possible. We already decided the topic of the project and I will add data related to that.

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *