Skip to content

Goal of Hackathon: Analyze a dataset of 700 social media accounts to identify signs of foreign interference operations in Canada and present the findings to a panel of government officials.

Notifications You must be signed in to change notification settings

khushil-sketch/Canadian-Information-Security-Datathon

Repository files navigation

The Goal

In this Datathon, we were tasked with the open-ended goal of searching for signs of covert Foreign Interference in Canada orchestrated via state-sponsored social media accounts. After analyzing the data, we transformed our findings into clear, actionable insights through engaging visualizations and compelling data-driven narratives. We aimed to provide government officials with a solid foundation of the impact of Foreign Interference, to help them make informed decisions regarding this pivotal issue

We were provided with a dataset of State Affiliated Social Media accounts, which served as our starting point. After some exploratory data analysis, I found that Twitter was the most popular platform and we decided to focus our analysis on it.

Important

This repo contains datasets, code files and the data-driven presentation for the 72-hour Canadian Network & Information Security (CANIS) Datathon.

Check out our Jupyter Notebook, which contains most of the analysis and visualizations that I did: https://colab.research.google.com/drive/1emBGbJVeE6vVSzEzPkSXX5NvRF9azT6W?usp=sharing

Check out our Presentation: https://prezi.com/view/Y6Qm9wQupHFbJIEw8BA7/

Collaborators: Khushil Nagda, Adel MĂĽĂĽrsepp, Jeffrey Zhou

Step 1: Scrapping the data

We scrapped 6000 tweets from the Twitter accounts in the dataset - the original dataset was missing crucial Twitter Metadata such as Likes, Comments, Post content and this information would be incredibly useful for our goal.

A glimpse of the 6000 tweet dataset that we scrapped - we extracted crucial information such as the engagement of each post and its Likes count.

image

Step 2: Sentiment Analysis

With this huge dataset, there were multiple angles from which we could approach our foreign interference investigation. However, one crucial component was missing, a component that would help us identify specific accounts that were attempting to influence opinions: A Sentiment Analysis of Tweets.

Leveraging NLTK's Vader library, we analyzed the Sentiment and Subjectivity of the Tweets, assigning relevant metrics to quantify the negativity/positivity of tweets.

Now that the data collection was done, my work began. Using Pandas, Matplotlib and Plotly, I cleaned & transformed the datasets, investigated anomalies, performed exploratory visualizations and uncovered hidden insights that i brought to light in the final visualizations that I crafted.

Step 3: Visualizations

Who owns the accounts? What regions do they target?

Network Flow Diagrams

image

image

Investigating Likes

One way to catch accounts masquerading as authentic is by analysing their likes vs followers. Suspicious accounts have high followers and low likes OR low followers and high likes

image

Out of all the Twitter accounts represented above, 29 were anomalous, but are all of them worth investigating? image

Out of all the 29 anomalies, there's only 4 that are significant enough to investigate. What makes them significant is that the average number of Like Counts for these 4 anomalies were calculated based on a large sample of tweets unlike the other potential anomalies which had a low tweet count.

image

image

Investigating the Outliers

Out of the 4 anomalies, 2 are run by individual people. Let's look at metadata of 1 such account, Serena Dong's account

image image

image

For an account such as Serena's, where she has 47800 followers, the distribution of her Tweet Views, Likes and Retweets seems pretty normal and within the realm of possibility. However 3 things need to be done:

  1. A benchmark has to be created of what constitutes a normal distribution of Tweet Views, Likes and Retweets for accounts with a similar level of following
  2. Further analysis needs to be done of the anomalies (popular posts) in Serena's account
  3. Perhaps investigating Twitter Views Shall reveal more information

Investigating Twitter Views

image

image

There are lot of anomalous accounts revealed by the scatterplots. This begs the question - why do some accounts have disproportionately higher Tweet views than followers? The Sentiment Analysis can provide some clues

What regions have the most positive and negative Tweet sentiment?

image

What is the distribution of Sentiment Scores for the 6000 tweets?

image

Do Polarising tweets get Seen More?

A notable observation is that even though there are fewer tweets that have a negative sentiment score, a large number of them get a high number of views. This shows that the accounts are quite successful at dividing opinion i.e. influencing the masses

image

Isolating the anomalies: Highly Viewed Tweets that have extreme sentiment scores

image

Our Presentation

Lastly, we presented the results of our analysis in a manner that was both clear and concise, ensuring comprehensibility for a diverse audience. This included making our work accessible even to those who don't have domain expertise in statistics.

About

Goal of Hackathon: Analyze a dataset of 700 social media accounts to identify signs of foreign interference operations in Canada and present the findings to a panel of government officials.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published