First of all, what is sentiment analysis? Sentiment analysis is an NLP (natural language processing) tool used to identify whether the data is positive, negative, or neutral. It is usually used by organizations to categorize the feedback towards their brand and to provide appropriate campaigns based on the results.
In our master thesis project, 2 types of data will be used for sentiment analysis. The first one is Azerbaijani news articles randomly gotten from different local web pages. The main goal here is to identify the sentiment value of the news. The data is separated into 2 sections: train data and test data. Train data will be labeled as positive, negative, neutral based on the polarity of the news article. Afterwards, test data will be used to evaluate the accuracy of the program.
The second data we will use in our project is a lexicon dictionary. We will be manually translating the words and assign a sentiment value to them. After having a sufficient level of words in the dictionary, the program will be able to evaluate the whole given text. Of course, the number of words in the dictionary will have an effect on the accuracy of the result. The more words, the higher accuracy will be achieved. The main goal here is that there exist lots of dictionaries in other languages, but for Azerbaijani, there is a lack of verified sources. Therefore, building a reliable dictionary in Azerbaijani and serve the program to local organizations will have a dramatic influence on the progress of the brands.
NLP formed the basis of modern software that recognizes and interprets human language. Social networks are a crucial part of this evolution as we everyday produce terabytes of data unintentionally only by utilizing modern social platforms. Natural Language Processing is the developing branch of Artificial Intelligence which aims to automate interactions between computers and humans using the structure of natural language.
This is a great start! Let's dig a little bit into the data. Can you get access to the local news articles? How many? (It's ok to guess, at this point). What would it take to do that? For example, would you need to write a web-crawler?
How could you get labels for sentiments? How many do you think you would need? Are there articles for which you could get some sentiment label automatically? (For example, you might be able to automatically look up who won a soccer match, and the articles in the hometown of the winning team is likely to have positive sentiment and the article from the hometown of the losing team is likely to be negative).
Thank you, Professor, for the feedback. We have got approximately ~30.000 news articles from one of the news companies. Currently, ~15.000 of them are labeled manually by us to build the train test.
In the shown example above, the structure of the news article differs for each winning/losing country.
The losing country posts (a real-world example):
"England loses a football match; a nation mourns - THE NEWSPAPERS’ favoured word was “heartbreak”. After two hours of football against Italy on July 11th, the country’s football team yet again failed to win a tournament, having at last got to a final. To make matters worse, it lost—again—in a penalty shoot-out, shredding the nerves and dashing the hopes of its fans. The players who missed penalties suffered sickening abuse online."
The winning country posts (a real-world example):
"Italian President Sergio Mattarella to the Azzurri: "You didn’t just try to win, you won by playing magnificent football. You displayed harmony as a team and in how you played, and Roberto Mancini deserves our thanks. He’s shown faith ever since taking over the role, revolutionized the team’s build-up play and been accurate in his preparations for every match.” "
In other words, when the player country posts a news article about its own country, depending on if it wins or loses, the sentiment (emotion) of the article differs particularly for him. And according to the weight of the bag of words, the sentiment value is calculated.