- I've created the repository and filled out the necessary files
- Begun looking into using the Reddit API to create my corpus
I've begun the process of utilizing PRAW and PMAW to scrape data from r/AmITheAsshole. I've discovered that it's difficult to use PRAW, the Python wrapper for the built-in Reddit API, alone for my needs. This is because the built-in Reddit API does not allow you to query posts past a certain duration of time (I'm still unsure what that duration is), and only lets you scrape 1,000 posts at a time. Because I'm only using one subreddit for this project, that does not give a satisfactory number of posts to use. Luckily, there's a third-party API called the Pushshift API, which has the Python wrapper PMAW, which allows you to scrape all archived posts for a given subreddit. However, I came across a new problem: apparently, the Pushshift API is undergoing a migration and does not have access to any data from before November 2022 (source). Still, AITA is a popular subreddit, so there is still a large amount of data available, but going forward it might be important to consider how small of a sliver temporally into the trends of speaking patterns people are exhibiting. Lastly, it seems that limiting the amount of posts you are scraping is currently broken. If you set your limit above 1,000 posts, PMAW will just scrape everything available. As such, I am yet to have a successful run of scraping all of my data at once, and am doing it in batches in a notebook I currently have set to be ignored. I have a notebook that has a sample of the process available at "code/data_collection_testing.ipynb"
As for sharing plans, all of the data I use for this project should hypothetically already be available for people to view, so I think it would be safe to have my corpus publicly accessible. However, I might consider omitting the usernames of the posters. Although many people use a throwaway/temporary account to post to this subreddit, I am unsure whether publicizing the usernames in this corpus could result in some harm.
I spent most of the time between now and trhe previous update figuring out any methods to improve my use of PMAW to get the reults I wanted, establishing some more specific questions I could explore in my data, and learning about potential methods to answering those new questions.
I ultimately discovered that there's still no solution to the post limit problem with PMAW and that there's no way to filter the search through the API based on link flair. So, the best solution I could determine was to just cast the widest net I feasibly could given the tools available and then pare down to what I wanted. I also discovered in this process that a vast majority of posts that are ever posted to this subreddit are eventually deleted, And so I start off with a corpus of 100,000 and end with one of just barely 9,500. I think, given the limits of the current tools, that might be the best that I'm able to get. The final data collection notebook is at data_collection.ipynb and the final data set can be found at data_collection.ipynb.
As for questions, my main two goals were to think of things that could represent people obscuring or justifying their story, and then to think of the means currently possible in my skillset to explore them. To start out, I thought it would be easy to implement a few things from Homework 2 to see if post length could be a meanignful indicator. Then, I was curious about who/what is being focalized in these stories. I thought that testing for the sentence subjects would be a good indicator of that. I found the spacy library in this process. A lot of my time was spent trying to get the library to work at all within Jupyter Notebook and understand how it works. I also found most of my time spent waiting for spacy to complete its processes before I could check if it did what I wanted. I'm not sure whether doing these operations directly within the DataFrame is adding meaningful overhead or if it just takes a long time to use spacy in general, but it takes around half an hour to process through the whole DataFrame on my machine. The analysis I do is not the most for now, but I've set up the methods to which I can easily explore questions--like counting the use of the passive and the use of each pronoun--using spacy to analyze dependencies. Data analysis can be found at data_analysis.ipynb.
I spent my time continuing the analysis being done in data_analysis.ipynb. My time was largely spent trying to explore the next question I set out to consider: How do these writers utilize passive and active voice?
I started out researching various means to use spaCy to sort sentences based on voice. While researching I came across several answers which all used the Matcher object. In particular, I attempted to implement this process I found on StackOverflow, but could not get it to work successfully. I believe the main problem is that the sentences in the data set are typically compound sentences, and they utilize terms and syntaxes unique to this subreddit. This means that spaCy has a hard time determining dependencies, and that it's hard to define exactly where the boundaries of one clause is. If someone utilizes a passive clause it is often embedded in a larger active sentence. So, I spent a while trying to see what's going on and if there's any simpler ways to go about separating clauses by voice.
For the sake of analysis, I ended up setting that project aside and decided to keep focusing on the information I could get out of the sentence subjects I parsed last time. In the process I did find some tokens inaccurately tagged as a passive or active subject, but for the most part I find it to be reliable. I adjusted by previous functions to be able to break down the number of passive versus active subjects, and then I used that information to dive into some example posts to see how people are utilizing voice. Some of my findings were fairly straightforward—for example, inanimate objects are represented more passively than people are—but there are some examples that showcase the different reasons why people emphasize or de-emphasize volition. I think pivoting to analyze some textual examples is effective to getting my point across about passive voice and agency, but if possible I'd still like to try and achieve some more quantitative evidence.
I also updated my repo to have my license. I decided on using GNU General Public License v3.0 as I'd be fine and even excited if others were to modify or build upon my code, and I think it's safe to share access to my data set now that it doesn't have usernames on them.
Lastly, as I worked through the analysis for this section, I did notice the limitations that my data set has because of it's size. Before next week's presentations I might give one more try at increasing the number of posts I have in my data set, but ultimately my focus from here on will be on seeing if I can get passive/active sentence sorting to work and on implementing machine learning tools to try and find any patterns on the topics that are written about in each ruling, any common phrases, and any further insights into how people curate the tone of their stories.