In this Datathon, we were tasked with the open-ended goal of searching for signs of covert Foreign Interference in Canada orchestrated via state-sponsored social media accounts. After analyzing the data, we transformed our findings into clear, actionable insights through engaging visualizations and compelling data-driven narratives. We aimed to provide government officials with a solid foundation of the impact of Foreign Interference, to help them make informed decisions regarding this pivotal issue
We were provided with a dataset of State Affiliated Social Media accounts, which served as our starting point. After some exploratory data analysis, I found that Twitter was the most popular platform and we decided to focus our analysis on it.
This repo contains datasets, code files and the data-driven presentation for the 72-hour Canadian Network & Information Security (CANIS) Datathon.
Check out our Jupyter Notebook, which contains most of the analysis and visualizations that I did: https://colab.research.google.com/drive/1emBGbJVeE6vVSzEzPkSXX5NvRF9azT6W?usp=sharing
Check out our Presentation: https://prezi.com/view/Y6Qm9wQupHFbJIEw8BA7/
Collaborators: Khushil Nagda, Adel MĂĽĂĽrsepp, Jeffrey Zhou
We scrapped 6000 tweets from the Twitter accounts in the dataset - the original dataset was missing crucial Twitter Metadata such as Likes, Comments, Post content and this information would be incredibly useful for our goal.
A glimpse of the 6000 tweet dataset that we scrapped - we extracted crucial information such as the engagement of each post and its Likes count.
With this huge dataset, there were multiple angles from which we could approach our foreign interference investigation. However, one crucial component was missing, a component that would help us identify specific accounts that were attempting to influence opinions: A Sentiment Analysis of Tweets.
Leveraging NLTK's Vader library, we analyzed the Sentiment and Subjectivity of the Tweets, assigning relevant metrics to quantify the negativity/positivity of tweets.
Now that the data collection was done, my work began. Using Pandas, Matplotlib and Plotly, I cleaned & transformed the datasets, investigated anomalies, performed exploratory visualizations and uncovered hidden insights that i brought to light in the final visualizations that I crafted.
One way to catch accounts masquerading as authentic is by analysing their likes vs followers. Suspicious accounts have high followers and low likes OR low followers and high likes
Out of all the Twitter accounts represented above, 29 were anomalous, but are all of them worth investigating?
Out of all the 29 anomalies, there's only 4 that are significant enough to investigate. What makes them significant is that the average number of Like Counts for these 4 anomalies were calculated based on a large sample of tweets unlike the other potential anomalies which had a low tweet count.
Out of the 4 anomalies, 2 are run by individual people. Let's look at metadata of 1 such account, Serena Dong's account
For an account such as Serena's, where she has 47800 followers, the distribution of her Tweet Views, Likes and Retweets seems pretty normal and within the realm of possibility. However 3 things need to be done:
- A benchmark has to be created of what constitutes a normal distribution of Tweet Views, Likes and Retweets for accounts with a similar level of following
- Further analysis needs to be done of the anomalies (popular posts) in Serena's account
- Perhaps investigating Twitter Views Shall reveal more information
There are lot of anomalous accounts revealed by the scatterplots. This begs the question - why do some accounts have disproportionately higher Tweet views than followers? The Sentiment Analysis can provide some clues
A notable observation is that even though there are fewer tweets that have a negative sentiment score, a large number of them get a high number of views. This shows that the accounts are quite successful at dividing opinion i.e. influencing the masses
Lastly, we presented the results of our analysis in a manner that was both clear and concise, ensuring comprehensibility for a diverse audience. This included making our work accessible even to those who don't have domain expertise in statistics.