Headline Snaps are part of an experiment a friend and I started some time during our college years.
Essentially, they are fabricated news headlines inspired mostly by real world events, but exaggerated for comedic effect. Specifically, the source data we're working with is .png
or .jpeg
files, created via Snapchat, with a plain black background and white text as the news headline.
Once we realized the comedy and entertainment value they could provide, we started producing Headline Snaps more frequently, to the point where we've now amassed several thousand of them over the course of 6-7 years. The sheer volume led us to want to convert them into some form of analyzable data, so we could run various experiments (graphs, trends, themes, language models, text synthesis). This is what sparked the motivation for this project.
I preferred to keep this project a set of tools (hence 'toolkit'), rather than a "pre-made database plus specific set of experiments" created from our current set of source data. This way, it could be applied to future sets of Headline Snaps, should anyone decide to emulate this experiment with their own friends. I think it's interesting to see the interplay between two (or more) participants' unique experiences and sense of humor, once the source material is converted to useable data throughout the pipeline created in this project. From there, it can be used to generate new Headline Snaps in a way that captures a fusion of the sentiments of the contributions to the dataset.
Of course, the particular incarnation of the data (an image file with text representation) is not optimal (neither for ease of creation nor data processing). But, for us, it became such a habit, and our existing thousands of Headline Snaps were in this form already. So I decided that part of the pipeline for data processing would need to be an OCR step, since black background + white text is rather ideal for OCRing. It became an element of the learning aspect of this project.
- OCR/Tesseract via pytesseract
- image processing via PIL
- command line argument handling via argparse
- language models / text synthesis via nltk
- database handling with sqlite
- data analysis with word clouds, bar graphs, etc.
- more Python development