Another great quarter of learning. Learned quite a bit from both Dr. Mehdi Hashemipour and Dr. John M. Fossaceca.
Jupyter Notebook on GitHub - ipynb file on GitHub
My Jupyter Lab Notion Notes for local setup
Overview
For my program research topic, I was thinking of leveraging GPT to investigate fake or real content in social media. I wanted to start with natural language processing (NLP) knowing how relevant it would be with ChatGPT. I found some text data with a feed of real and fake news. At some point it would be neat to try to see if I can do this with images and videos. For now, let us just start small with text and language.
The problem I see is that generative AI technologies are starting to become more and more pervasive after the popularization of OpenAI’s ChatGPT. Many have been posting false information all over the internet and social media. I believe with ChatGPT there is a likeliness that this technology will generate content that may be derivative from real or just fake and looks real. I think from a cybersecurity perspective it makes sense to have tools to inspect and tell if something is real or fake. Possibly later even find fake content and have a bot post real information so people do not get tricked.
The question I have is can NLP predict real from fake news in social media? If this can is there confidence that ChatGPT can likewise detect real from fake news in social media?
Data
I was trying to search various cybersecurity related datasets given the links provided. I knew there had be some data related to social media and found something related to real and fake news on Twitter! I think given the tweets are small and rich with words along with files filtered by real and fake so I can label appropriately it made a lot of sense to give it a try.
There were four files in this location (link below). It also looked like there were tools to gather new data if desired. The samples seemed good enough for what I needed. There were four CSV files named politifact_fake.csv, politifact_real.csv, gossipcop_fake.csv, and gossipcop_real.csv. I took the data of the four files and merged them into one Pandas dataframe. In total there was 23196 rows and 5 columns. The columns were id, news_url, title, tweet_ids, and label (Figure 1). All the data was textual and the first thing I did with the labels was convert fake and real into numeric 0 and 1. When sorting the data, I found there was an imbalance where most of the data was real.
https://github.com/KaiDMML/FakeNewsNet/tree/master/dataset
There were 17441 labeled as real news and 5755 labeled as fake. There was also data that had null values in the news_url and tweet_ids (Figure 2). I cleaned the data and was left with 16120 real news records and 5287 fake news records .There were 17441 labeled as real news and 5755 labeled as fake. There was also data that had null values in the news_url and tweet_ids (Figure 2). I cleaned the data and was left with 16120 real news records and 5287 fake news records.
The imbalance was pretty staggering with over 75.3% real to 24.7% fake records (Figure 3). I decided to over-sample the fake data to get a 50/50 mix of records (Figure 4). I tried under-sampling initially and found it really dragged down the accuracy later.
Before moving into modeling, I decided to perform an 80/20 split of the dataset for training and testing respectively.
Model
Given I was doing NLP I knew data would have to be pre-processed and tokenized. I went online and found information related to bagging words as a technique. The lines of text would have to be stripped of useless characters, then tokenized, and finally lemmatized. The lemmatization process reduces the words into more base words removing things like “ing” from “remove.” These words would then be associated with the labels previously provided for the word bags.
Again the news data would have to be split again after this for by X and Y, and for training versus testing datasets. Afterwards I used a confusion matrix to get a sense of how well the split data looked between the true and false data predictions (Figure 5). This looked acceptable so again it was good to move on.
Results
Given past experiences with AutoML and having just learned about PyCaret there was a curiosity to try it out and see if it could help shortcut the best classifiers for the news dataset prediction. Looking at Figure 6, it looked like Gradient Boosting Classifier (gbc) and Random Forest Classifier (rf) were very promising. When looking at the Recall column I noticed gbc and rf had a weakness where some sort of linear or logistic regression could help with prediction. Given the AutoML had lower accuracy levels where the highest was 0.7259 with an AUC of 0.7315. This did not seem bad given the F1 score was also fairly high around 0.7844 being the highest. The F1 helps to evaluate the performance of these individual models evaluating both precision and recall.
The model optimization according to AUC shows that the performance of different machine learning models for binary classification is on the average of 0.7316 (Figure 7). This is not very high, but it is pretty fair and I have hopes that with an ensemble we can make something better.
Again, when testing the model against the hold-out test dataset we find the accuracy is close to what we were seeing with regards to accuracy and AUC during training (Figure 8).
I tried to perform gradient boosting alone using the scikit-learn classifier and graphing to see how the ROC curve looked. The curve looked off and it was as expected with regards to Accuracy being around that lower 70% range but interestingly the ROC-AUC looked better, nearing 80% (Figure 9).
So, I looked into logistic regression and random forest knowing that these were curves of interest that may help predictions with gradient boosting (Figure 10). Surprisingly logistic regression accuracy was 0.86 with an ROC-AUC score of 0.93, and random forest accuracy was 0.91 with an ROC-AUC score of 0.97! It almost seems that with just these two a prediction model could be put together without the use of the AutoML favorite gradient boosting.
It made sense to give an ensemble of these two models a try and this led to predictability with accuracy of 0.91 with an ROC-AUC of 0.97 (Figure 11). What would happen if we added all three classifiers?
It turns out not much and it actually hurt the predictability slightly with an accuracy moving from 0.908 to 0.903 and ROC-AUC score moving from 0.969 to 0.960 (Figure 12).
It was interesting how PyCaret’s AutoML seemed to have lower levels of accuracy than creating models using various Scikit Learn classifiers. I do see when working with binary classification as in this scenario random forest and linear regression seem come up quite a bit as winners as it did here. When creating an ensemble what can be seen was a smoothing of the curve (Figure 13).
Yet, it was clear that random forest alone was pretty accurate and had a greater ROC-AUC score. If anything, adding gradient boosting or linear regression would make it worse.
Conclusions
From the analysis it was learned that the tokenizing and word bagging were great techniques to help develop an NLP predictor that can detect real versus fake news from Twitter. Given this kind of rudimentary predictor was fairly effective, the sense is strong that ChatGPT could also not only generate content but also detect malformities in content. The curiosity is there, and this would have to be something as a suggestion to try in future studies.
Links
- Original GitHub project for datasets - https://github.com/KaiDMML/FakeNewsNet/tree/master/dataset
- My GitHub project with Jupyter Notebook showing how to NLP this thing
https://github.com/nispoe/FakeNews-NLP-Prediction - My Notion notebook for local Jupyter Lab setup
https://eopsin.notion.site/Jupyter-Lab-f5a66095fcf14328b3b8c56fc545487d