Read how the research team of the "In Search of Zora" project devised and executed their analysis.
“In Search of Zora/When Metadata Isn’t Enough: Rescuing the Experiences of Black Women Through Statistical Modeling”
Authors / Project Team:
Nicole M. Brown - Saint Mary’s College of California
Ruby Mendenhall - University of Illinois at Urbana-Champaign
Michael Black - University of Massachusetts Lowell
Mark Van Moer - University of Illinois at Urbana-Champaign
Karen Flynn - University of Illinois at Urbana-Champaign
Malaika McKee - University of Illinois at Urbana-Champaign
Assata Zerai - University of New Mexico
Ismini Lourentzou - University of Illinois at Urbana-Champaign
ChengXiang Zhai - University of Illinois at Urbana-Champaign
There are many ways to use HathiTrust and HTRC for text analysis research. Here we give an example of one way the HTDL and HTRC were used for a research project.
This project uses topic modeling and hundreds of thousands of volumes in HathiTrust to develop a research method that helps identify documents written by or about Black women that were previously unattributed or inaccurately attributed. This guide will take you through the data, process, and tools used by the researchers.
Recommended tools/skills to conduct similar research:
(Please note that many people contributed to this project, so expertise in all of these skills and tools is not required of a single individual)
- Access to HathiTrust and JSTOR
- Help from a trusted librarian
- MALLET Topic Modeling tool
- Knowledge of how to use MALLET
- Knowledge of boolean search syntax
- Some knowledge of Latent Dirichlet Allocation (LDA) as a topic modeling algorithm.
- Knowledge of topic inference
- Knowledge of Kullback-Leibler divergence and Cosine Similarity with topic modeling.
Below is a description of the process developed by the researchers called SeRRR which stands for Search and Recognition, Rescue, and Recovery.
As with all research, we first need research questions. The researchers of this project had three primary research questions:
- What themes emerge about African American women using topic modeling?
- How can the themes identified be used to recover previously unmarked documents?
- How might we visualize the recovery process and Black women’s recovered history?
Search and Recognition
To answer these questions they needed a corpus to perform topic modeling on containing documents written by or about Black women. The authors utilized the expertise of a university librarian to help with the corpus creation process. To create this corpus the researchers used the HathiTrust and JSTOR digital libraries and performed the following boolean search:
(Black OR “african american” OR negro) AND (wom?n OR female* OR girl*)
In addition, the search was limited to English-language documents from 1746-2014 in the following formats:
Both JSTOR and HathiTrust have ways for researchers to obtain the full texts as well as the metadata for each volume. Because this is a guide for users of HathiTrust, the focus will be on obtaining documents from HathiTrust. More information on acquiring full-text documents from JSTOR can be found on their website.
For the data from HathiTrust, this project made use of a custom dataset request from HathiTrust. Custom datasets are available for public domain materials, though there are restrictions in place based on who digitized the item and where the requesting researcher is located. While not used in this project, another data access option could have been via an HTRC Data Capsule. The capsules are available to anyone with an HTRC account, though only researchers from member institutions are allowed to analyze texts from the full HathiTrust collection, including in-copyright volumes. While the default storage allotment would not hold the 800,000 plus volumes used for this project, it will hold up to 100GB of data. The researchers for this project had access to and used High Performance Computing resources at the University of Illinois.
The full-text HathiTrust volumes come as .txt files and the metadata for each volume comes as a .tsv or JSON file. For this project JSON was used. Once the researchers performed the above search, they pulled out 19,398 documents that they believed to be authored by African Americans based on the metadata with known African American author names and/or phrases such as “African American” or “Negro” in the subject headings. Of these 19,398 documents, 6,120 came from the HathiTrust database and 13,278 from the JSTOR database.
The researchers then ran the 19,398 documents through the MALLET topic modeling tool and set the number of topics to generate to 100. Five researchers then reviewed the keyword list that makes up each topic. This included the words as well as the number of times the word was assigned to that particular topic. Each researcher individually determined which topics were most relevant in that they were about Black women specifically or were topics with which Black women might be connected (i.e. maternal health, education). Once each individual identified topics of interest, the researchers gathered and discussed findings, viewed visualizations, adjusted interpretations and came to a consensus on relevant topics.
The researchers then developed a tool that scanned the word-to-topic assignments for each document and returned a list of document titles that had a given percentage of their words assigned to the identified topics of interest. They called this technique “intermediate reading,” as opposed to the distant reading of the initial topic model done in MALLET. This intermediate reading initially served as a means to evaluate the validity of the topic model; however it also led to a re-evaluation of the topics generated by the initial model. Intermediate reading produced a table that listed the percentage of the words from a document that are attributed to the topic in the first column, then the year of the document in the second column, and then the title of the document, a partial abstract and author(s) in the third column.
An analysis of the intermediate reading led the researchers to do a more traditional close reading of a few of the titles highlighted by the results. These close readings gave further insights into the lives of Black women.
Having performed this process on the subset of documents, topics were now identified as being specifically by or about Black women. Now using these topics the researchers configured the LDA statistical model to mimic the discursive patterns found within text that is by or about Black women. The researchers created a workflow that involved a pre-processing that combined regular expressions with functions from Python’s Natural Language Toolkit (NLTK) library to filter, segment, and tokenize documents into a form suitable for topic modeling. Once the documents were pre-processed, LDA topic modeling was then applied to the larger collection using an extension of MALLET that helps with parallelization issues so the results are reproducible and in machine readable formats, and then saves the model to disk so it can be used later for topic inference.
The researchers then used these models to perform topic inference with an extension of MALLET to identify documents in the larger 800,000-volume corpus with similar discursive patterns specific to the narratives of Black women. Inference works similarly to the initial topic model except that the word distribution probabilities are no longer changeable as they are in the initial topic model. This is why if you run LDA on the same corpus with the same parameters you get different sets of keywords (topics) every time you run it (unless you can set the seed). Since inference uses the set model of previously run LDA you no longer have to contend with the probability issue. The result of the inference is that the percent of topic coverage in each document is produced, and this is called a topic vector. Using the topic vectors, the researchers compared each document in the collection to the previous subset of 19,398 documents.
Rescue and Recovery
The researchers now turned to two metrics for the LDA model to determine the strength of the discursive patterns of the programmed set. These metrics were Kullback-Leibler divergence and Cosine similarity. Now that they had the topic vectors for each document, cosine similarity and KL divergence were calculated for each document. This allowed the researchers to rank the similarity and dissimilarity of the larger corpus to the subset. Comparing the results of the top 20% of the cosine similarity list and the top 15% of the KL list the researchers found 150 previously unidentified documents related to Black women.
The researchers used HathiTrust and JSTOR digital libraries as well as an innovative topic modeling technique to help weed through 800,000 documents to find 150 documents by or about Black women that had previously been unattributed or mislabelled. In addition, the use of topic modeling highlighted various issues in the narratives of Black women. Some of these include the erasure of Black women through the documentation of Black children in asylums, and descriptions of Black women found in the footnotes of medical journals, which highlighted the practice of performing non-consensual medical procedures on Black women.
Similar methods can be used in conjunction with volumes from the HTDL and analytical tools from the HTRC by future researchers to find other voices lost and marginalized in the pages of history.
Fascinating. But I am from social science not the humanities so it is interesting how differently I think those disciplines would have proceeded with this kind of project (& it's also telling that when identifying the investigators, the description includes their institution but not their department).
Topic modeling has had a substantial impact in the social sciences as well, but I suspect that is transitory. It reminds me of the fascination with factor analysis when numeric data were first available for social science analysis. But almost no social scientist does factor analysis as a statistical analysis today except as a measurement method for already defined concepts.
So, I am guessing that a social scientist with these interests would have started with a specific hypothesis about how Black writing differed from White writing (This is a little far from my own work so I am not sure of an example of what that hypothesis would be). The problem then would be how to operationalize that difference in terms of text differences, not searching for a computer algorithm that would suggest what those differences were. Operationalizing a text difference that reflects a theoretical difference betweeen Black writing and White writing also requires lots of trial-and-error programming; but it doesn't (from my perspective) abdicate to the computer to come up with the concepts we need to use.