The repository for the term project has been created. Several files have been added to the repository, such as LICENSE.md, project_plan.md, progress_report.md, README.md, and .gitignore. I have figured out which subreddits I would like to collect posts from, which are several subreddits dedicated to various cities in the United States. I have registered my project with the Reddit API so that I can use the Reddit API for my project.
The 1st Progress Report revolved around the data that I will use for this project. The first thing that I did was read about and experiment with PRAW, the Python Reddit API wrapper that I used to collect posts and various information about posts. The notebooks/ folder and the dataCollection notebook were created to collect data for this project. The next thing I did was select a variety of different subreddits that I believed would be good to analyze. I have switched from the idea of doing subreddits related to cities, as I agreed with the feedback that I received that the regional differences wouldn't be very noticeable in writing. Therefore, I have pivoted to the idea of different subreddits that would potentially exemplify the different ways people write, such as gamers, lawyers, and students. Once I selected a variety of subreddits, I used the API to read in posts from the subreddits into DataFrames. I decided to include information such as the title, the author, the text, the number of upvotes, the number of comments, and the ratio of votes. I cleaned these dataframes up through removing entries that had blank text. I read these dataframes into csv files for data display purposes. It took about fifteen subreddits to reach my goal of >= 10,000 posts. I am not sure as to why the API didn't allow me to read more than 900 posts from a subreddit at once, but that is something to look into. All of the data from the subreddits is saved into .csv files in the data folder. In any case, the notebook took an extremely long time to run as the calls to read in the posts were very time consuming, which may be due to the way I read them in. I should see if there is anything I can do to speed that up in case I need to read in posts again. A sample of this data can be found here. The data notebook can be found here.
Considering all of the data that I am using is publicly available posts on Reddit, I plan to make all of my data publicly available. Reddit only limits uses in the case where it is commercial, which I do not intend to do.I will review the licensing information on Reddit before I make a final decision. If it turns out I am wrong and I cannot make my data publicly available, if possible, I will try to post smaller samples to exemplify the type of data that I worked with during this project. If I cannot do this, then I will create fake posts to exemplify how I worked with the data on this project, such as annotations
The 2nd Progress Report revolved around data organization and analysis. The first thing that I did was create a second notebook, dataOrganization.ipynb. In this notebook, my goal was to combine the several .csv files that were generated from the first notebook, dataCollection.ipynb. The reason for this is because the API is limited to 1000 posts per call, and I am working with a quantity of posts way larger than that. Therefore, I had to query subreddits multiple times, with sometimes having to wait for a few hours later, in order to get more posts. Once the .csv files were combined into one post, I then cleaned up columns and made all of the dataframes the same size. I found that the highest number that all of the subreddits could be, which is 1500. Therefore, each subreddit is now a collection of 1500 posts. At the end of this notebook, I packaged them up into .csv files for further use. The second thing that I did was create a third notebook, dataAnalysis1.ipynb. In this notebook, I used the language-tool-python as a base assessment for grammaticality. This included finding the top errors across subreddits, analyzing thoses errors, and comparing those errors between subreddits. This Progress Report taught me a lot about how huge data collection, organization , and analysis is. If I wasn't spending time figuring out how to do a specific task, I was copying the same task 15 times for each subreddit and letting it run. Therefore, a lot of additional setup and configuring was required before I could get into the analaysis portion. In addition, I organized and formatted my notebooks in a way that makes them more readable.
For the found portion of my data, I will be making all of my data publicly available. Reddit's rules for the API are here. Within this webpage, Reddit outlines that posts can be displayed and formatted. Therefore, I am free to post samples of my data, which is csv files with information relating to Reddit posts. Data samples can be found here. Within this folder, there are several csv files. Most of these are the final subreddit csv files, and a few of the original subreddit csv files before clean up.
For my project, I have chosen the GNU General Public License v3.0. I am happy to have other people use and modify my code that I have written in this repository, as long as I am given credit for the original work that I have put in.
The 3rd Progress Report revolved around data analysis. The first thing that I did was create a third notebook, dataAnalysis2.ipynb. To note, this was a NEW CONTINUING notebook. In this notebook, my goal was to go even further of my analysis of errors in subreddit posts. This involved taking a closer look at some of the errors across all of the subreddits. Specifically, I uncovered that spelling errors are the most common error across all subreddits. This isn't surprising considering it is the Internet and typos happen all of the time. However, I found something interesting with the tool that I used. I discovered that the tool I use marks people's names names of applications, abbreviations, and regional words as errors. This is a lot of marked sentences that aren't actually ungrammatical. Therefore, I made the decision to ignore all spelling errors as, in terms of grammaticality, this error isn't very helpful. In addition to this, I once again looked at the top errors of each subreddit, with certain errors filtered out as they were not helpful in my analysis. I discovered that COMMA_COMPOUND_SENTENCE is the largest error that is related to grammaticality. I also explored other top errors, and took a closer look at sentences containing these errors. I found that some errors aren't helpful for my analysis, for example related to formatting. I realized that I should have continued using my dataAnalysis1 notebook, as I reuse some code from it, but it was too late into my work to switch it over as some things took awhile to run. Working with data of this size is very challenging, especially when it comes to analysis, as I realize there is a lot of filterinng that needs to happen but isn't feasible with a dataset of this size. I have not made any changes to my data since the 2nd Progress Report. All of the data is still in its final form in the data_samples folder. In addition, a link has been added to my README to my guestbook.