In machine learning data plays a crucial role. No matter how great an algorithm is if the data provided is not sufficiently good, result will be a poor model. On the other hand with the correct data even with the "average" algorithm a model with high precision can be created. There are few points to consider when validating the data.
Data has to be clean, meaning: 1. It must be complete - there should not be any empty field. 2. Accuracy of it should be audited and fixed if there is any inconsistency. 3. Similar data should have the same format. For example times/events all must be in 'timestamp' format.
Data availability: Usually teams in a project makes several experiments and thus, they need available and easily accessible data to do it. Therefore, data should be kept in a place such as data lake or warehouse where every member of a team can access it.
Data Relevancy: For machine learning project data has to be relevant to the problem under consideration. There should be enough data sufficiently explaining the domain in which the problem exists. For example, in an anomaly detection project information about only card issuer and the location of the transaction might not be enough. Additional data such as transaction amount, date, etc. would result in much better performance.
Data Bias: A machine learning algorithm only learns from the data it was provided. Therefore if there is a bias in the data it will lead to unexpected results. There is a famous example in which the task was to distinguish US tanks from the Russian tanks. In the lab the project seemed promising with high accuracy, but in the field it failed. The reason was that when training the model they used the biased data, in which US tank images were all taken in a sunny day and the Russian tank images were taken in a cloudy day. Hence, the model learned to distinguish sunny day from the cloudy day not the friendly tanks from the enemy tanks.
Data Time Window: When data which captured in a wrong time period is used to train a model it wont be able to produce good result due to lack of the variance in the data. For example if it is a stock data and it only considers last month, it will not understand the dynamics of the stock market. In this case having few years of data would include a lot of different recurring events which can be useful. Having too long of a window can be a problem too. It can lead model to learn outdated information, hence perform poorly.