Skip to content

Oh my gosh... Artificial Intelligence gone wild!

Another great quarter of learning. Learned quite a bit from both Dr. Mehdi Hashemipour and Dr. John M. Fossaceca.

Jupyter Notebook as HTML

Jupyter Notebook on GitHub - ipynb file on GitHub

My Jupyter Lab Notion Notes for local setup

Overview

For my program research topic, I was thinking of leveraging GPT to investigate fake or real content in social media. I wanted to start with natural language processing (NLP) knowing how relevant it would be with ChatGPT. I found some text data with a feed of real and fake news. At some point it would be neat to try to see if I can do this with images and videos. For now, let us just start small with text and language.

The problem I see is that generative AI technologies are starting to become more and more pervasive after the popularization of OpenAI’s ChatGPT. Many have been posting false information all over the internet and social media. I believe with ChatGPT there is a likeliness that this technology will generate content that may be derivative from real or just fake and looks real. I think from a cybersecurity perspective it makes sense to have tools to inspect and tell if something is real or fake. Possibly later even find fake content and have a bot post real information so people do not get tricked.

The question I have is can NLP predict real from fake news in social media? If this can is there confidence that ChatGPT can likewise detect real from fake news in social media?

Data

I was trying to search various cybersecurity related datasets given the links provided. I knew there had be some data related to social media and found something related to real and fake news on Twitter! I think given the tweets are small and rich with words along with files filtered by real and fake so I can label appropriately it made a lot of sense to give it a try.

There were four files in this location (link below). It also looked like there were tools to gather new data if desired. The samples seemed good enough for what I needed. There were four CSV files named politifact_fake.csv, politifact_real.csv, gossipcop_fake.csv, and gossipcop_real.csv. I took the data of the four files and merged them into one Pandas dataframe. In total there was 23196 rows and 5 columns. The columns were id, news_url, title, tweet_ids, and label (Figure 1). All the data was textual and the first thing I did with the labels was convert fake and real into numeric 0 and 1. When sorting the data, I found there was an imbalance where most of the data was real.

https://github.com/KaiDMML/FakeNewsNet/tree/master/dataset

Figure 1 - example data from dataset

There were 17441 labeled as real news and 5755 labeled as fake. There was also data that had null values in the news_url and tweet_ids (Figure 2). I cleaned the data and was left with 16120 real news records and 5287 fake news records .There were 17441 labeled as real news and 5755 labeled as fake. There was also data that had null values in the news_url and tweet_ids (Figure 2). I cleaned the data and was left with 16120 real news records and 5287 fake news records.

Figure 2 - Graph of features that have null values

The imbalance was pretty staggering with over 75.3% real to 24.7% fake records (Figure 3). I decided to over-sample the fake data to get a 50/50 mix of records (Figure 4). I tried under-sampling initially and found it really dragged down the accuracy later.

Figure 3 - Before dataset balancing
Figure 4 - After dataset balancing

Before moving into modeling, I decided to perform an 80/20 split of the dataset for training and testing respectively.

Model

Given I was doing NLP I knew data would have to be pre-processed and tokenized. I went online and found information related to bagging words as a technique. The lines of text would have to be stripped of useless characters, then tokenized, and finally lemmatized. The lemmatization process reduces the words into more base words removing things like “ing” from “remove.” These words would then be associated with the labels previously provided for the word bags.

Again the news data would have to be split again after this for by X and Y, and for training versus testing datasets. Afterwards I used a confusion matrix to get a sense of how well the split data looked between the true and false data predictions (Figure 5). This looked acceptable so again it was good to move on.

Figure 5 - News Confusion Matrix

Results

Given past experiences with AutoML and having just learned about PyCaret there was a curiosity to try it out and see if it could help shortcut the best classifiers for the news dataset prediction. Looking at Figure 6, it looked like Gradient Boosting Classifier (gbc) and Random Forest Classifier (rf) were very promising. When looking at the Recall column I noticed gbc and rf had a weakness where some sort of linear or logistic regression could help with prediction. Given the AutoML had lower accuracy levels where the highest was 0.7259 with an AUC of 0.7315. This did not seem bad given the F1 score was also fairly high around 0.7844 being the highest. The F1 helps to evaluate the performance of these individual models evaluating both precision and recall.

Figure 6 - PyCaret AutoML Results

The model optimization according to AUC shows that the performance of different machine learning models for binary classification is on the average of 0.7316 (Figure 7). This is not very high, but it is pretty fair and I have hopes that with an ensemble we can make something better.

Figure 7 - Tune model optimizing to AUC

Again, when testing the model against the hold-out test dataset we find the accuracy is close to what we were seeing with regards to accuracy and AUC during training (Figure 8).

Figure 8 - Testing the Gradient Boosting Classifier

I tried to perform gradient boosting alone using the scikit-learn classifier and graphing to see how the ROC curve looked. The curve looked off and it was as expected with regards to Accuracy being around that lower 70% range but interestingly the ROC-AUC looked better, nearing 80% (Figure 9).

Figure 9 - Gradient Boosting Classifier

So, I looked into logistic regression and random forest knowing that these were curves of interest that may help predictions with gradient boosting (Figure 10). Surprisingly logistic regression accuracy was 0.86 with an ROC-AUC score of 0.93, and random forest accuracy was 0.91 with an ROC-AUC score of 0.97! It almost seems that with just these two a prediction model could be put together without the use of the AutoML favorite gradient boosting.

Figure 10 - Logistic Regression and Random Forest Classifiers

It made sense to give an ensemble of these two models a try and this led to predictability with accuracy of 0.91 with an ROC-AUC of 0.97 (Figure 11). What would happen if we added all three classifiers?

Figure 11 - Ensemble combining Logistic Regression and Random Forest

It turns out not much and it actually hurt the predictability slightly with an accuracy moving from 0.908 to 0.903 and ROC-AUC score moving from 0.969 to 0.960 (Figure 12).

Figure 12 - Adding Gradient Boosting with other two classifiers

It was interesting how PyCaret’s AutoML seemed to have lower levels of accuracy than creating models using various Scikit Learn classifiers. I do see when working with binary classification as in this scenario random forest and linear regression seem come up quite a bit as winners as it did here. When creating an ensemble what can be seen was a smoothing of the curve (Figure 13).

Figure 13 - Side by side comparison of ROC curve when put in ensemble

Yet, it was clear that random forest alone was pretty accurate and had a greater ROC-AUC score. If anything, adding gradient boosting or linear regression would make it worse.

Conclusions

From the analysis it was learned that the tokenizing and word bagging were great techniques to help develop an NLP predictor that can detect real versus fake news from Twitter. Given this kind of rudimentary predictor was fairly effective, the sense is strong that ChatGPT could also not only generate content but also detect malformities in content. The curiosity is there, and this would have to be something as a suggestion to try in future studies.

Links

Twitter distributed background from CrowdStrike

What is an advanced persistent threat (APT)?

An advanced persistent threat is a “Threat from a highly organized attacker with significant resources that is carried out over a long period of time” (McMillian, 2021). McMillian describes victims as large corporations or government entities and by well-funded group of highly skilled individuals from nation-states. Most of these attacks are hard to detect but can be through logs and performance metrics capturing environmental abnormalities.

What is a Nation-state actor?

“Nation-state or state sponsors are usually foreign governments. They are interested in pilfering data, including intellectual property and research and development data, from major manufacturers, tech companies, government agencies, and defense contractors. They have the most resources and are the best organized of any of the threat actor groups.” (McMillian, 2021).

“The security firm Mandiant tracked several APTs over a period of 7 years, all originating in China, specifically Shanghai and the Pudong region. These APTs were simply named APT1, Apt2, and so on.” (Easttom, 2018).

“The attacks were linked to PLA Unit 61398 of China’s military. The Chinese government regards this unit’s activities as classified, but it appears that offensive cyber warfare is one of its tasks. Just one of the APTs from this group compromised 141 companies in 20 different industries. APT1 was able to maintain access to victim networks for an average of 365 days, and in one case for 1, 764 days. APT1 is responsible for stealing 6.5 terabytes of information from a single organization over a 10-month timeframe.” (Easttom, 2018).

CrowdStrike 2022 Global Threat Report

CrowdStrike’s annual 2022 Global Threat Report describes the various naming conventions to categorize adversaries according to nation-state affiliations. These are the codenames for various adversary actors studied by CloudStrike when analyzing the various tactics, techniques, and procedures (TTP) case studies.

Mandiant M-Trends 2022 Report

The annual Mandiant M-Trends 2022 Report highlights the techniques most frequently used in 2021 with regards to MITRE ATT&CK. The 10 most frequently seen techniques are listed below and tied to the various MITRE ATTA&CK framework identifiers.

APT detection frameworks

MITRE ATT&CK

MITRE ATT&CK is a knowledge base of adversary tactics and techniques based on real-world observations. The knowledge base is used as a foundation for the development of specific threat models and methodologies. These threat models are created by the private sector, government, and cybersecurity product and service community. The ATT&CK framework is open to the community for use at no charge to develop and mature our ability to detect and defend against common adversary targets.

MITRE Engage (formerly MITRE Shield)

MITRE Engage was formerly the MITRE Shield framework leveraging MITRE ATT&CK. The specific framework is used to plan and discuss adversary engagement operations showing you how to engage adversaries to best achieve cybersecurity goals. There are several tools and guides focusing on matrix, playbook process, community, standards, and mindset. Framework provides a starter kit which leads you down a path with basics, language, methodologies, adversary engagement, and joining the community.

Lockheed Martin Cyber Kill Chain

The Cyber Kill Chain is developed by Lockheed Martin and is part of their Intelligence Driven Defense model. The primary purpose of the chain is to determine how far down a path an intrusion has progressed and how to terminate the intrusion before it gets to the end of the chain of events. The model identifies what adversaries must do to achieve their ultimate goal. The framework has seven steps to enhance visibility into an attack and enrich an analyst’s understanding of an adversary’s tactics, techniques, and procedures. These seven steps from Lockheed Martin are listed below.

  1. Reconnaissance – Harvesting email addresses, conference information, etc.
  2. Weaponization – Coupling exploit with backdoor into deliverable payload.
  3. Delivery – Delivering weaponized bundle to the victim via email, web, USB, etc.
  4. Exploitation – Exploiting a vulnerability to execute code on victim’s system.
  5. Installation – Installing Malware on the asset
  6. Command & Control (C2) – Command channel for remote manipulation of victim
  7. Actions on objectives – With ‘Hands on Keyboard’ access, intruders accomplish their original goals

Diamond Model (Caltagirone et al, 2013)

“The model describes that an adversary deploys a capability over some infrastructure against a victim. These activities are called events and are the atomic features. Analysts or machines populate the model’s vertices as events are discovered and detected. The vertices are linked with edges highlighting the natural relationship between the features. By pivoting across edges and within vertices, analysts expose more information about adversary operations and discover new capabilities, infrastructure, and victims. The interactions about the diamond are defined by the axioms surrounding the various events occurring about the diamond.” (Caltagirone et al, 2013). These axioms and interactions focus on the diamond event, adversaries, victims, phases, resources, and social-political factors with relationship to persistent adversary relationships. The seven main axioms from the paper are listed below, and how these relate to various threads various adversaries would take during a kill chain path.

  • Axiom 1 – For every intrusion event there exists an adversary taking a step towards an intended goal by using a capability over infrastructure against a victim to produce a result.
  • Axiom 2 – There exists a set of adversaries (insiders, outsiders, individuals, groups, and organizations) which seek to compromise computer systems or networks to further their intent and satisfy their needs.
  • Axiom 3 – Every system, and by extension every victim asset, has vulnerabilities and exposures.
  • Axiom 4 – Every malicious activity contains two or more phases which must be successfully executed in succession to achieve the desired result.
  • Axiom 5 – Every intrusion event requires one or more external resources to be satisfied prior to success.
  • Axiom 6 – A relationship always exists between the Adversary and their Victim(s) even if distant, fleeting, or indirect.
  • Axiom 7 – There exists a sub-set of the set of adversaries which have the motivation, resources, and capabilities to sustain malicious effects for a significant length of time against one or more victims while resisting mitigation efforts. Adversary-Victim relationships in this sub-set are called persistent adversary relationships.

MITRE Caldera

Caldera is a framework that automates adversary simulations. Security teams can build adversary profiles and launch them in the network to see where there are weaknesses. This helps test defenses and people’s ability to detect specific threats. The framework consists of the core system and plugins. The core system is the framework code which includes an asynchronous command-and-control (C2) server. The plugins expand the framework to provide agents, reports, and collections of TTPs. The GitHub repository for MITRE Caldera is below and can be leveraged by Red Team (attack) efforts to build a stronger Blue Team (defend).

GitHub for MITRE Caldera

References

Golden Chopsticks Kimchi SLAC Design

(1) The kimchimenow.com web servers have web Splunk Universal Forwarder installed to capture log file data. This is done on load balanced servers deployed in three availability zones. Below are the most critical Linux logs to monitor.

  • /var/log/syslog or /var/log/messages - stores all activity data across the Linux system.
  • /var/log/auth.log or /var/log/secure - stores authentication logs
  • /var/log/boot.log - messages logged during startup
  • /var/log/maillog or var/log/mail.log - events related to email servers
  • /var/log/kern - Kernel logs
  • /var/log/dmesg - device driver logs
  • /var/log/faillog - failed login attempts
  • /var/log/cron - events related to cron jobs or the cron daemon
  • /var/log/yum.log - events related to installation of yum packages
  • /var/log/httpd/ - HTTP errors and access logs containing all HTTP requests
  • /var/log/mysqld.log or /var/log/mysql.log - MySQL log files

(2) Splunk Enterprise Server is where all the various logs from the kimchimenow.com web application servers will forward logs to. The server is where we see aggregation of event data. Data related to system, network, operating system, database, application, web server, and user events.

(3) Machine Learning techniques are used to correlate and identify associations between event data. Some of the common even correlation techniques are time, rule, pattern, topology, domain, and history based.

(4) The correlated data can be normalized and integrated into security information and event management (SIEM). SEIM dashboards can be created to:

  • Overview of notable events in your environment that represent potential security incidents.
  • Show details of all notable events identified in your environment, so you can undertake triage.
  • Have a workbook of all open investigations, allowing you to track your progress and activity while investigating multiple security incidents.
  • Perform risk analysis that lets you score systems and users across your network to identify risks.
  • Display threat intelligence that is designed to add context to your security incidents and identify known malicious actors in your environment.
  • Show protocol intelligence using captured packet data to provide network insights that are relevant to your security investigations, allowing you to identify suspicious traffic, DNS activity and email activity.
  • Show user intelligence lets you investigate and monitor the activity of users and assets in your environment.
  • Show web intelligence to analyze web traffic in your network.

SIEM provides real-time visibility, enhances investigations, and can fast-track threat response. The MITRE ATT&CK framework can be leveraged to determine frequent attack vectors and vulnerabilities in an IT ecosystem.

References

Spunk: What is IT Event Correlation?

Splunk: What Is Security Information and Event Management (SIEM)?

Exabeam: SIEM Logging: Security Log Aggregation, Processing and Analysis

Splunk: Splunk Security Essentials

Install and Setup Splunk Enterprise Server and Splunk Universal Forwarder on an AWS EC2 instance.

Link to my notes on Notion

Video on YouTube


Index

  • Create EC2 Instance
  • Splunk Enterprise Server
  • Splunk Universal Forwarder

Create EC2 Instance

  • Launch creation of an EC2 instance - Launch Instance
  • Set Name as splunk and leave the rest of the defaults
  • Set Key pair as one you created or Create a new key pair (RSA and .pem format) and Launch Instance button to create the EC2 instance
  • Navigate to the new instance and grab the public IPv4 address
  • SSH into the instance
  • Perform an update
sudo yum update

Splunk Enterprise

  • Navigate to Linux and click the Download button for .rpm
  • Cancel the download prompt and get the wget command to download Splunk
  • Execute the wget in the terminal to download the install file, and then install the splunk .rpm file.
wget -O splunk-9.0.2-17e00c557dc1-linux-2.6-x86_64.rpm "https://download.splunk.com/products/splunk/releases/9.0.2/linux/splunk-9.0.2-17e00c557dc1-linux-2.6-x86_64.rpm"

sudo yum install ./splunk-9.0.2-17e00c557dc1-linux-2.6-x86_64.rpm
  • start the splunk server
sudo bash

cd /opt/splunk/bin

./splunk start --accept-license --answer-yes
  • Enter administrator username and password, remember this because you will need this to log into the application
  • In AWS navigate to the EC2 instance Security groups
  • Edit inbound rules
  • Add rule to open port 8000 and Save rules
  • Under messages will see this message, will need to fix this for Splunk to work
  • Navigate to Settings > Server settings and then General settings
  • Under Index settings set Pause indexing if free disk space from 5000 to 50 and Save

Splunk Universal Forwarder

  • Select Linux, then in the 64-bit section click the Download button for .rpm
  • Cancel the download popup window and then copy the wget command
  • Open the terminal, exit from the root user and go back to the home directory
  • Execute the wget and after install the forwarder
wget -O splunkforwarder-9.0.2-17e00c557dc1-linux-2.6-x86_64.rpm "https://download.splunk.com/products/universalforwarder/releases/9.0.2/linux/splunkforwarder-9.0.2-17e00c557dc1-linux-2.6-x86_64.rpm"

sudo yum install ./splunkforwarder-9.0.2-17e00c557dc1-linux-2.6-x86_64.rpm
  • Change to the splunkforwarder bin directory and start the forwarder
sudo bash

cd /opt/splunkforwarder/bin

./splunk start --accept-license --answer-yes
  • Enter username and password
  • Set the port for the forwarder to 9089, this is to keep splunk server from conflicting with the splunk forwarder
  • Set the forwarder to forward to the splunk server on port 9997, and will need to enter username and password
./splunk add forward-server 3.137.207.15:9997
  • Set the forwarder to monitor the /var/log directory and restart
./splunk add monitor /var/log

./splunk restart
  • Set the port for the splunk server to listen and restart
cd /opt/splunk/bin

./splunk enable listen 9997

./splunk restart
  • In AWS navigate to Security groups and again Edit Inbound rules
  • Add port 9997 and Save rules
  • Log into splunk again
  • Go to the splunk home
  • Go to Search & Reporting
  • Select the Data Summary button
  • Under the Hosts tab there should be the server and forwarder and select the first ip location link
  • There should be logs from the /var/log location in the list

I'm a student working on my doctoral degree at GWU for Cybersecurity Analytics.

While working on an assignment one professor had an example blog that was used to publish homework content. I looked into this and found that GWU also has a personal blog system that can be leveraged.

I take notes in Notion, and share with peers some of my notes. Yet, there are some notes I believe are worth sharing out more widely to help with how to do something. Like setup VMs with Kali Linux and Metasploitable. Or how to setup WSL 2 locally to use Linux on a windows machine to connect to an AWS instance you created.

Anyways... let's see how this goes... Be Lucky.

John Kuk - john.kuk@gwu.edu