Skip to content

Data for data scientist.

Most of the areas in Machine Learning (ML) and Artificial Intelligence (AI) requires huge amount of data for training a model. And in the case if there is enough data and it matches to all requirements then the accuracy of the model will be high.

In my case, I would prefer to work with text in Azerbaijani language. In my previous experience, the main data source of text for me was news published in local web pages and the orthographic dictionary of Azerbaijani language. The dictionary had more than ten thousand words in initial form labeled with part of speech to which they belong. The data which is in the dictionary can be considered as a clean, because all words in orthographic dictionary are written in the correct way and matches to their labels. However, the data-set from news is dirty. It does not have any labels, has random text, misspelled words and that is the reason why accuracy of application decreases.

I am planning to do a project either related to Natural Language Processing (NLP) or Natural Language Understanding (NLU). Currently, I am thinking about spell correction. Therefore, most probably I will need to use the previous data from the dictionary, and additionally I will need to create a data set with misspelled words.

1 thought on “Data for data scientist.

  1. pless

    I think that you've correctly identified the need to find exactly what data you'd like work with. Maybe it is useful to really dig into this. If you use news published in recent articles, can you easily get access to those articles?

    The other thing that isn't clear to me is what exactly are the inputs and outputs for your model. Are you trying to take "potentially mis-spelled and awkwardly written Azerbaijani text and make it correctly spelled", or is it something else?

    Is spell correction especially hard in Azerbaijani? Does it exist in other places (like gmail or MS Word?).

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *