Child pages
  • Towards Cultural-Scale Models of Full-Text project
Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 7 Next »


This project deploys an improved infrastructure for robust corpus building and modeling tools within the HTRC Data Capsule framework to answer research questions requiring large-scale computational experiments on the HTDL. Our research questions depend on the capacity to randomly sample from full text data to train semantic models from large worksets extracted from the HTDL. This project prototypes a system for testing and visualizing topic models using worksets selected according to the Library of Congress Subject Headings (LCSH) hierarchy. (this report link to be deleted)

Project report can be found at


Colin Allen, Jaimie Murdock (Indiana University)

Jiaan Zeng (HTRC)


Large-scale digital libraries, such as the HathiTrust, give a window into a much greater quantity of textual data than ever before (Michel2011). These data raise new challenges for analysis and interpretation. The constant, dynamic addition and revision of works in digital libraries mean that any study aiming to characterize the evolution of culture using large-scale digital libraries must have an awareness of the implications of corpus sampling. Cultural-scale models of full text documents are prone to over-interpretation in the form of unintentionally strong socio-linguistic claims.

One methodology with rapid uptake in the study of cultural evolution is probabilistic topic modeling~\cite{Blei2012}. Researchers need confidence in sampling methods used to construct topic models intended to represent very large portions of the HathiTrust collection.  For example, topic modeling every book categorized under the Library of Congress Classification Outline (LCCO)\footnote{\url{}} as ``Philosophy'' (call numbers B1-5802) is impractical, as any library will be incomplete. However, if it can be shown that models built from different random samples are highly similar to one another, then the project of having a topic model that is sufficiently representative of the entire HT collection may become tractable.




  • No labels