Description of dataset

There are different aspects of research process. To find and use the right dataset is also important, especially for the projects related to Natural Language Processing. In most of the NLP projects it is crucial to use real dataset to get effective implementation and better outcomes.

Developing Dialog Management Systems is one of the such projects. To build conversational agent which is capable of handling complex dialog turns and continue human-like conversation, dataset used to train the models also should contain some conversational patterns. For example, text of messages, dataset from the messengers such as WhatsApp, Facebook or online customer service are good examples for possible datasets. Additionally, dataset should reflect the characteristics of the project domain. For example, chatbots of the touristic companies must be able to handle common customer requests, related to the touristic places, etc.

In my work, I am planning to use real human to human messages in Azerbaijani language. It is noisy dataset which contains misspellings, noise of internet data and incomplete sentences. Additionally, agglutinative nature of Azerbaijani language, for instance, having several morphological forms of the same words, should also be considered. On the other hand, this dataset has applicable for chatbots structure, in the form of questions and answers.

1 thought on “Description of dataset”

Robert Pless July 21, 2021 at 8:20 pm

Thinking about conversations in Azerbaijani is certainly a domain for a potential project. However, it is *very* important to drill down and get more specific details about what datasets you can access. For example, I don't think you will be able to get access to a large collection of Facebook or WhatsApp data. If you have connections to some company, you *may* be able to get access to data from that company, but companies are often very reluctant to share that.

Wikipedia (https://az.wikipedia.org/wiki/Ana_səhifə) and Reddit (but I can't find an easy way to find Azerbaijani posts there) are both quite easy to scrape, but they may have text that have different properties than a conversation.

I also think that "building a chatbot" is quite hard and a big project; if there is an existing chatbot code that you can start from, and show how you can improve it in some setting, that would be completely ok and let you focus on the parts of the problem that haven't already been done.

Reply ↓

1 thought on “Description of dataset”

Leave a Reply Cancel reply