The Goal

In this Datathon, we were tasked with the open-ended goal of searching for signs of covert Foreign Interference in Canada orchestrated via state-sponsored social media accounts. After analyzing the data, we transformed our findings into clear, actionable insights through engaging visualizations and compelling data-driven narratives. We aimed to provide government officials with a solid foundation of the impact of Foreign Interference, to help them make informed decisions regarding this pivotal issue

We were provided with a dataset of State Affiliated Social Media accounts, which served as our starting point. After some exploratory data analysis, I found that Twitter was the most popular platform and we decided to focus our analysis on it.

Important

This repo contains datasets, code files and the data-driven presentation for the 72-hour Canadian Network & Information Security (CANIS) Datathon.

Check out our Jupyter Notebook, which contains most of the analysis and visualizations that I did: https://colab.research.google.com/drive/1emBGbJVeE6vVSzEzPkSXX5NvRF9azT6W?usp=sharing

Check out our Presentation: https://prezi.com/view/Y6Qm9wQupHFbJIEw8BA7/

Collaborators: Khushil Nagda, Adel Müürsepp, Jeffrey Zhou

Step 1: Scrapping the data

We scrapped 6000 tweets from the Twitter accounts in the dataset - the original dataset was missing crucial Twitter Metadata such as Likes, Comments, Post content and this information would be incredibly useful for our goal.

A glimpse of the 6000 tweet dataset that we scrapped - we extracted crucial information such as the engagement of each post and its Likes count.

Step 2: Sentiment Analysis

With this huge dataset, there were multiple angles from which we could approach our foreign interference investigation. However, one crucial component was missing, a component that would help us identify specific accounts that were attempting to influence opinions: A Sentiment Analysis of Tweets.

Leveraging NLTK's Vader library, we analyzed the Sentiment and Subjectivity of the Tweets, assigning relevant metrics to quantify the negativity/positivity of tweets.

Now that the data collection was done, my work began. Using Pandas, Matplotlib and Plotly, I cleaned & transformed the datasets, investigated anomalies, performed exploratory visualizations and uncovered hidden insights that i brought to light in the final visualizations that I crafted.

Step 3: Visualizations

Who owns the accounts? What regions do they target?

Investigating Likes

One way to catch accounts masquerading as authentic is by analysing their likes vs followers. Suspicious accounts have high followers and low likes OR low followers and high likes

Out of all the Twitter accounts represented above, 29 were anomalous, but are all of them worth investigating?

Out of all the 29 anomalies, there's only 4 that are significant enough to investigate. What makes them significant is that the average number of Like Counts for these 4 anomalies were calculated based on a large sample of tweets unlike the other potential anomalies which had a low tweet count.

Investigating the Outliers

Out of the 4 anomalies, 2 are run by individual people. Let's look at metadata of 1 such account, Serena Dong's account

For an account such as Serena's, where she has 47800 followers, the distribution of her Tweet Views, Likes and Retweets seems pretty normal and within the realm of possibility. However 3 things need to be done:

A benchmark has to be created of what constitutes a normal distribution of Tweet Views, Likes and Retweets for accounts with a similar level of following
Further analysis needs to be done of the anomalies (popular posts) in Serena's account
Perhaps investigating Twitter Views Shall reveal more information

Investigating Twitter Views

There are lot of anomalous accounts revealed by the scatterplots. This begs the question - why do some accounts have disproportionately higher Tweet views than followers? The Sentiment Analysis can provide some clues

What regions have the most positive and negative Tweet sentiment?

What is the distribution of Sentiment Scores for the 6000 tweets?

Do Polarising tweets get Seen More?

A notable observation is that even though there are fewer tweets that have a negative sentiment score, a large number of them get a high number of views. This shows that the accounts are quite successful at dividing opinion i.e. influencing the masses

Isolating the anomalies: Highly Viewed Tweets that have extreme sentiment scores

Our Presentation

Lastly, we presented the results of our analysis in a manner that was both clear and concise, ensuring comprehensibility for a diverse audience. This included making our work accessible even to those who don't have domain expertise in statistics.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
Account Bio Sentiments and Other Metrics.csv		Account Bio Sentiments and Other Metrics.csv
CANIS dataset.xlsx		CANIS dataset.xlsx
CANIS datathon submission.pdf		CANIS datathon submission.pdf
CANIS_Data_Visualization_and_Foreign_Interference_2023-11-17.pdf		CANIS_Data_Visualization_and_Foreign_Interference_2023-11-17.pdf
Final_Version_CANIS_Data_Visualization_Project_Notebook.ipynb		Final_Version_CANIS_Data_Visualization_Project_Notebook.ipynb
Original_and_tweets.csv		Original_and_tweets.csv
README.md		README.md
tweets_final.csv		tweets_final.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Goal

Important

Step 1: Scrapping the data

Step 2: Sentiment Analysis

Step 3: Visualizations

Who owns the accounts? What regions do they target?

Investigating Likes

Investigating the Outliers

Investigating Twitter Views

What regions have the most positive and negative Tweet sentiment?

What is the distribution of Sentiment Scores for the 6000 tweets?

Do Polarising tweets get Seen More?

Isolating the anomalies: Highly Viewed Tweets that have extreme sentiment scores

Our Presentation

About

Releases

Packages

Languages

khushil-sketch/Canadian-Information-Security-Datathon

Folders and files

Latest commit

History

Repository files navigation

The Goal

Important

Step 1: Scrapping the data

Step 2: Sentiment Analysis

Step 3: Visualizations

Who owns the accounts? What regions do they target?

Investigating Likes

Investigating the Outliers

Investigating Twitter Views

What regions have the most positive and negative Tweet sentiment?

What is the distribution of Sentiment Scores for the 6000 tweets?

Do Polarising tweets get Seen More?

Isolating the anomalies: Highly Viewed Tweets that have extreme sentiment scores

Our Presentation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages