Marshall A. Taylor and Dustin S. Stoltz
This repository contains all R code and data necessary to reproduce our analyses in our "Integrating Semantic Directions with Concept Mover's Distance to Measure Binary Concept Engagement" paper, forthcoming in Journal of Computational Social Science. It is a short note follow-up to our paper in the same journal (Stoltz and Taylor 2019), "Concept Mover's Distance: Measuring Concept Engagement via Word Embeddings in Texts."
In the the original JCSS paper, we put forth a method for measuring concept engagement in texts that uses word embeddings to find the minimum cost necessary for words in an observed document to "travel" to words in a pseudo-document—a document consisting only of words denoting a concept of interest. One potential limitation with our method is that words associated with opposing concepts will be located close to one another in the underlying embedding space, meaning that a document's closeness to one concept will likely have similar closeness to a starkly opposing concept (e.g., "life" and "death"). In this short note, we propose a method for dealing with this "binary concept problem" in CMD by incorporating recent work on word embeddings in cultural sociology. Using aggregate vector differences between antonym pairs to extract a direction in the semantic space pointing toward a pole of the binary opposition ("The Geometry of Culture," American Sociological Review, 2019)—we illustrate how CMD can be used to measure a document's engagement with binary concepts.
To reproduce the figures and regression models in the paper, download all scripts and CSVs to a local folder, and load the packages in the 1_cmdgeo_prep_functions.R script. The remaining scripts are self-contained, and refer to the respective section of the note. Some of the figures require downloading text from Project Gutenberg which may take some time. Note also that our CMDist
function has been updated to include semantic directions; as such, you will need to update the package to the most recent version (0.4.1 as of March 25, 2020) in order to replicate the analyses.