The goal of this project is to perform data science and sentiment analysis on various subreddits to determine patterns in how to reach the top post. Below is a sample report.
Click here to see the notebook on training the classifier
In this notebook, we will perform sentiment analysis, and some general analysis on the top 1k submissions in r/Toronto using the naive-bayes classification model we built previously, and some other tools. We will be using Seaborn to create plots.
The dependancies are listed out in requirements.txt. They can be quickly installed with Pip by running the following command
python -m pip -r requirements.txt
import pandas as pd
import pickle
import numpy
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
from nltk.tokenize import sent_tokenize, word_tokenize
from train import cleanup_sentence, sentence_to_dict
sns.set_theme()
TITLE = 'r/Toronto'
_classifier = open('bin/classifier.o', 'rb')
classifier = pickle.load(_classifier)
df = pd.read_csv('data/toronto_dump.csv')
df['author'].convert_dtypes()
df['author'].fillna(value='Unknown', inplace=True)
df['created_utc'] = pd.to_datetime(df['created_utc'], unit='s')
flairs = df.loc[:]['link_flair_text']
flairs = pd.DataFrame.from_dict(Counter(flairs), orient='index')
flairs = flairs.sort_values(by=0, ascending=False)
ax = sns.countplot(y='link_flair_text', data=df, order=df['link_flair_text'].value_counts().index)
ax = ax.set(title='Flairs in r/Toronto')
Pitures, Videos, and news are amongst the top flairs in r/Toronto. There are some flairs which are unavaiable for general usage such as 'Stats Misleading', or 'Fine B O I'.
title_is_positive = []
for i, sentence in enumerate(df.loc[:]['title']):
s = word_tokenize(str(sentence))
s = cleanup_sentence(s)
s = sentence_to_dict([s])
inp = []
for key in s:
inp.append(key)
pred = classifier.classify(key)
title_is_positive.append('Positive' if pred else 'Negative')
df['title_is_positive'] = title_is_positive
ax = sns.countplot(y='title_is_positive', data=df)
ax = ax.set(xlabel='Count', ylabel='Emotion', title='Naive Bayes Classification of Submission Titles')
pos_count = df.loc[df['title_is_positive'] == 'Positive', 'title_is_positive'].count()
neg_count = len(df) - pos_count
print(pos_count, neg_count)
614 385
614/1000 submissions in r/Toronto all were classified as positive submissions.
contributors = {}
max_posts_count = -1
max_posts_author = ""
for contributor in df.loc[:]['author']:
if contributor == 'Unknown':
continue
if contributor in contributors:
contributors[contributor] += 1
if contributors[contributor] > max_posts_count:
max_posts_count = contributors[contributor]
max_posts_author = contributor
else:
contributors[contributor] = 1
print('Number of unique contributors: {}'.format(len(contributors)))
print('Top contributor in {} all posts: {} ({} submissions)'.format(TITLE, max_posts_author, max_posts_count))
Number of unique contributors: 757
Top contributor in r/Toronto all posts: ur_a_idiet (16 submissions)
Out of 1000 top posts in r/Toronto. 75.7% of top posts were made by unique redditors. The person with the most submissions in all is ur_a_idiet with 16 submissions.
Most of the top posts (>70%) were submitted in the afternoon or evening. Few top posts were submitted at night.
hours_of_submissions = df['created_utc'].dt.hour
minutes_of_submissions = df['created_utc'].dt.minute
times = []
for i in range(len(hours_of_submissions)):
hour = hours_of_submissions[i]
minute = minutes_of_submissions[i]
times.append(hour + minute / 60.0)
times = numpy.array(times)
times, counts = numpy.unique(times, return_counts=True)
ax = sns.lineplot(x=times, y=counts)
ax = ax.set(xlabel='Hour', ylabel='Frequency', title='Hour of Top Posts in {}'.format(TITLE))
Some more details that can be looked into are submission Redditor's account age, what kinds of posts are made during certain periods of the day. The classification model can also be improved, which we talked about in that lab.
We shall also do the same analysis in r/Toronto, but for a different time period to compare the results. During the pandemic, probably more of the top posts would be news, and videos. Furthermore, we expect the hour of posts to be more uniform as most of the population are spending more time online. More negative submissions would also be expected.