Advanced Collaborative Support (ACS) is a scholarly service at HTRC offering collaboration between external scholars and HTRC staff to solve challenging problems related to HTRC tools and services. By working together with scholars, we facilitate computational access to HathiTrust Research Center digital tools (HTRC) as well as the HathiTrust Digital Library (HTDL) based on individual scholarly need. ACS will drive innovation at the scholar's digital workbench for enhancing and developing new techniques for use within the HTRC platform.
Calls for proposals to participate in the ACS program go out approximately once per year. For questions, please send an email to email@example.com.
Laure Thompson and David Mimno (Cornell)
This project will develop methods for automatically constructing large-scale collections of genre fiction from HathiTrust. Even, and especially, in digital libraries as large as HathiTrust, it can prove challenging to understand whether the library contains suitable representations of a chosen genre. The researchers plan to focus on collections of speculative fiction novels as a case study, but they intend their solutions to be generalizable. They will identify robust methods for correlating author-title pairs to matching volume sets in HathiTrust. Using these methods in conjunction with lists of novels that were curated by hand, they will build their collections and investigate which works are (over)represented and which are missing. They expect their project will enable scholars to better understand the suitability of studying genre fiction through HathiTrust and highlight underserved author and genre groups. Moreover, the project will result in collections of genre fiction which can be readily reused and reorganized for different lines of humanistic inquiry.
Matthew J. Yoder and Dmitry Mozzherin (University of Illinois)
This project will create an index of all the scientific names of the Earth’s species found within the HathiTrust corpus. The index, which will likely measure in the hundreds of millions to billions of entries, will consist of a simple link between the scientific name and the volume and page location of that name within HathiTrust. The index will assist in identifying volumes that may be medically relevant, for example by identifying all of the volumes containing the scientific name for the mosquito that carries illnesses such as Zika virus (‘Aedes aegypti’). The index will also allow volumes to be grouped into clusters based on which scientific names they contain to show which taxon (e.g. “mammals”) are most common. This team of researchers has completed similar work across the data of the Biodiversity Heritage Library. Their ACS project will allow them to do cross-corpora comparisons.
Dan Sinykin (Emory)
This project furthers the researcher’s investigation into how the conglomeration of the publishing industry changed literature. The results will be included the researcher’s in-progress book titled The Conglomerate Era: A Computational History of Literature in the Age of the Agent. The project explores a set of publisher-based corpora to see whether there are distinctions in what is published by large publishing houses versus independent presses. It will make use of predictive modeling to further the researcher’s existing work to build a computational model of genre that aids in identifying latent patterns in the publishers’ editorial practices.The project will utilize methods such as genre detection through unsupervised modeling; stylistic differentiation through text classification and supervised learning via logistic regressions; and social network analysis with metadata to determine latent literary connections, especially with regard to gender and race of the author.
Stephen Krewson (Yale)
This project aims at identifying all pictorial elements in educational texts from 1800-1850 to explore the interplay between progressive education and print media in the early nineteenth century. The resulting research will characterize the extent to which wood engravings and other reprographic materials were shared among educational publishers. The researcher will extract specific features from page images, such as illustration location, using advances in machine learning. The project intends to make use of the process developed to identify pictorial elements to motivate a new metadata field that describes the location and type of illustrations on the page. An ultimate goal of the project is to move toward “machine-read” texts where the data generated by classifiers and dimensionality reduction techniques are bundled as metadata with the corresponding volumes and made available to future research. (“Machine-read” is a term is borrowed from researcher Ben Schmidt.)
Molly Des Jardin, Scott Enderle, and Katie Rawson (University of Pennsylvania)
This project intends to explore a novel way of abstracting and representing textual data that could aid in new ways of discovering and deduplicating items in HathiTrust, detecting and analyzing genre, or analyzing narrative analogies. The project team will investigate the utility of a certain kind of mathematical representation of text documents, called semantic phasor embeddings, that combine a mathematical structure called phasors with data from standard word embeddings (strings of numbers that represent an item). If successful, the vectors could represent documents with a tunable degree of granularity, which could provide an opportunity to share vectors representing copyright-protected without concerns about wholesale text reproduction. The vectors would also carry valuable information about the global ordinal structure of the volumes, so that the items could be queried, clustered, and visualized in a robust way that recognizes similarity not just in the content of the items, but also their structure.
Robin Burke, John Shanahan, Ana Lucic, DePaul University
The Reading Chicago Reading team will seek to extend their own research on the “One Book, One Chicago” city-wide reading program by incorporating textual analysis on books chosen for the OBOC program, as well as comparison texts. Further, the resulting textual analysis—including toponym extraction, sentiment analysis, and story arc detection—will be paired with library patron, circulation and demographic data to present a fuller picture about the OBOC program, and the books chosen for inclusion.
Project report: Computational Support for ‘Reading Chicago Reading’
David Bamman and Bjorn Hartmann, University of California, Berkeley
This project will utilize the HTRC Data Capsule to conduct feature extraction on page images from 10,000 in-copyright books in the HathiTrust repository, extracting features such as page construction, line justification, leading between baselines, kerning between letter pairs/combinations, line density per page, characters per line, position of images, typeface (serif, sans-serif) and font size. Beyond the analysis and utility of the extracted feature set, this project also seeks to serve as a use case for engagement with HathiTrust/HTRC beyond books-as-strings-of-words analysis.
Project report: Modeling the History of Book Design, HTRC Whitepaper: Summary of Activities
Laura Nelson, Northeastern University
Dr. Nelson’s project will study the women's movement in the United States from 1848-1975 in two cities, New York City and Chicago, using new advances in network analysis and computational text analysis to identify structural and cultural diversity. This approach is three-pronged: building a workset of writing by individuals and organizations within the movements in New York and Chicago, using network analysis to measure the structure of this movement, and conducting computational text analysis to measure the underlying culture and ideas within the movement, including lexical analyses to identify distinctive words and topic modeling to identify dominant themes.
Project report: The Power of Place: Structure, Culture, and Continuity in U.S. Women's Movements
Richard Jean So, McGill University
Dr. So’s project seeks to write a new history of the American novel by examining a series of large textual datasets focused on the full cycle of the U.S. literary field from production to reception to canonization. The major goal is to identify the emergence of new patterns of language, style, discourse and themes in American novels as they appear at different moments in the cycle of literary production and reception, including publication via large publishing houses such as Random House, and book reviews in major U.S. periodicals. This will be achieved through using the HTRC Data Capsule environment to undertake text analysis of full texts, including using various methods in Machine Learning and Natural Language Processing, such as topic models, word embeddings, and specialized tools such as BookNLP, which allows for the extraction of grammatical dependencies and characters.
Project report: A Computational History of the U.S. Novel, 1950-2000
Laura McGrath, Devin Higgins, Arend Hintze, Michigan State University
This work draws on ongoing collaborative efforts to develop a method for applying genetic sequencing tools to the study of literature in order to identify and measure literary novelty, and address questions of literary history, canonicity, and prestige. Previous results have been suggestive of a prominent connection between the purely information-based novelty of the sequences of characters that comprise literary texts, and the experimental newness we associate with modernist literary texts. Leveraging the HTRC Data Capsule will offer the potential to apply this theory at scale for the first time, and potentially lead into new research into modernism and the literary history of the 20th century.
Nicholas Kelly, Loren Glass, Nikki White, University of Iowa
The PEP team will compile a proof-of-concept workset with, at first, prominent individuals (faculty, staff, students) who were involved with the Iowa Writers’ Workshop (IWW), then produce “style cards” for each author’s works (by volume), based on stylometric data gathered through text analysis of the IWW workset within the HTRC Data Capsule. It is the goal of the project to also create a living workset that can be continually updated for scholars who wish to engage with IWW authors and their writing.
Project report: Program Era Project
The Life of Words
David-Antoine Williams, The University of Waterloo
This project aimed to match Oxford English Dictionary (OED) references to HathiTrust volume IDs, in order then to draw down associated metadata using the heterogenous and fragmented bibliographical data in OED2 and OED3. It furthered the work of The Life of Words (LOW), a research project in its third year, led by Dr. Williams at St Jerome’s University in the University of Waterloo in Canada. The aim of the project is to enhance the OED with metadata concerning its corpus of 3.5 million quotations.
Project report: The Life of Words
Mariola Espinosa, University of Iowa
This project seeks to explore the history of yellow fever in the Caribbean by comparing how the disease was described by residents of the Caribbean to the European perspective, including through sentiment analysis of text referencing yellow fever. Her work will be visualized spatially in a map generated with support from the University of Iowa’s Digital Scholarship and Publishing Studio. She will build a corpus of text from the HathiTrust Digital Library related to yellow fever and filth in the Caribbean to track the use of the terms “filth” and “filthiness” from 1650 to 1902.
Project report: Fighting Fever in the Caribbean
Samuel Franklin, Brown University
This project will map the increasing use and shifting meanings of the words “creative” and “creativity,” with a particular focus on the twentieth century. A custom “creativity corpus” will be assembled and processed to identify linguistic patterns via a number of text analysis and natural language processing techniques. Brown’s project will make use of the functionality developed for HathiTrust + Bookworm.
Project report: Inside the Creativity Boom
Dan Baciu, Illinois Institute of Technology
This project will look at the Chicago School of architecture and examine its history of reception over the last 75 years, as well as identify patterns in its international spread and influence. Baciu will use named entity recognition in his analysis, notably deploying the Wikifier tool on a large sample corpus of HathiTrust data for the first time.
Project report: The Chicago School: Evolving Systems of Value
Dallas Liddle, Augsburg College
This project will test two hypotheses about information theory and information density as they relate to a digital humanities approach to studying Romantic-era British fiction. The concept of "information" used in mathematical information theory may help digital humanists evaluate the information density of textual forms. This project tests a theory that the popular and critical success of the novel in British print culture after 1815 may be related to increased information density and sophistication of information encoding in those years, especially via innovations introduced by Jane Austen and Walter Scott.
Project report: Signal and Noise and Pride and Prejudice
Geoffrey Rockwell, Laura Mandell, Stefan Sinclair, Matthew Wilkens, Susan Brown, University of Alberta, Texas A&M University, University of Notre Dame
Rockwell, Mandell, Sinclair, Wilkens, and Brown aim to subset theoretical subsets from the HT public corpus and apply large-scale topic modeling on the subsets. The researchers will develop tools and computational methods for tracking the concept of "theory”.
Project report: The Trace of Theory project
Detecting Literary Plagiarisms: The Case of Oliver Goldsmith
Douglas Duhaime, University of Notre Dame
Duhaime will work on developing tools for detecting plagiarisms. He will focus on the case of Oliver Goldsmith, to detect the literary thefts of Goldsmith by using machine learning techniques.
Taxonomizing the Texts: Towards Cultural-Scale Models of Full Text
Colin Allen, Jaimie Murdock, Indiana University Bloomington
Allen and Murdock will carry out a cultural-scale investigation and topic modeling on HT public-domain full text through random sampling to select collections according to the Library of Congress Subject Headings (LCSH).
Project report: Towards Cultural-Scale Models of Full-Text project