Agriculture Arts & Culture Civil Rights Movement Community Organizing Crime & Punishment Discrimination & Affirmative Action Education Electoral Politics Labor Law & Government News & Media Poverty & Unemployment Public Health & Services Race Southern Vignettes Spiritual Life Voting Rights War & Violence
The categories above were generated using topic modeling, a form of text mining that identifies clusters of related words across a body of texts. An algorithm based on probability statistics creates groups of words that are likely to appear in the same context. Although these groupings can be thought of as ‘topics’, refining the data and assigning names to topics is a subjective process open to interpretation. Above are some general themes in Southern Changes discovered with the topic modeling toolkit MALLET.
Click to read more about this topic model.This topic model was generated to approximate the subject cataloging used in libraries. A key decision when using topic modeling tools is selecting the number of topics to identify. A large number yields fine-grained topics that may only be relevant in a single document, while a small number produces broad topics that are difficult to interpret. A good result is usually something that “feels right” for a particular research objective. In this case, running an algorithm to identify twenty topics produced a useful set of subjects and only two were excluded for being too broad to meaningfully interpret.
Topic models produce two kinds of output: 1) the top words for each topic and 2) the prevalence of each topic in each document. Top words are the most frequent and statistically significant words in a topic. In developing this model, a few adjustments were made to produce more meaningful top words. MALLET’s default list of stop words–common words like ‘and’ and ‘the’ that are excluded from analysis– was expanded. In this model, words like ‘issue’ and ‘staff’ were added to the list because they usually appear in the byline and have little to do with the topical content.
Certain top words hinted at the presence of phrases such as “affirmative action” and “civil rights.” A Python using the NLTK library was run to identify the most common two-word and three-word phrases. Then the plain text files were modified by substituting the spaces in these phrases with underscores so they could be incorporated into the model. Two articles in Spanish were excluded from the analysis because the English word topics did not apply to their content.
Finally, the topics were curated to serve as subjects. Stephanie Rodgers, a graduate student working in the Emory Center for Digital Scholarship, provided labels for the topics after reading many of the articles and identifying outliers to exclude from subject browsing.
The download below includes all plain text files, the expanded list of stop words, and the R script used to generate the model and wordclouds.
Sara Palmer
Emory Center for Digital Scholarship