Child pages
  • HTRC data access options
Skip to end of metadata
Go to start of metadata

HTRC provides access to data from the HathiTrust corpus in several forms across its suite of tools and services for computational text analysis. Data is periodically synced from HathiTrust, but not all HTRC tools and services are updated on the same schedule. Additionally, copyright and security concerns impact data availability and format. 

You may find stats about the HTRC collection and stats about the HathiTrust corpus also helpful for understanding HathiTrust data.

The following table disambiguates textual data access, including availability and format, within the HTRC ecosystem. 

Tool or service

# volumes (as of 3/18)

rights statusdata access mechanismfile formatdata formatpermissions required
HTRC Analytics algorithms5.9 millionPublic domainVia HTRC Workset; researcher runs tool without accessing underlying data(no file access)

Uncorrected OCR text data

HTRC Analytics account
HT+Bookworm tool13.7 millionAllVia web-interface; researcher visualizes data without accessing underlying data(no file access)Unigrams (single words), based on HTRC Extracted Features dataset(none)
HTRC Data Capsule5.9 millionPublic domainHTRC Data APIZipped text files in PairTree directory structureUncorrected OCR text dataHTRC Analytics account
HTRC Extracted Features dataset15.7 millionAllrsyncJSON files in PairTree directory structureVolume- and page-level metadata and part-of-speech tagged page-level "bags of words"(none)
HathiTrust custom dataset6 million, dependent on institutional agreementsPublic domain; accessibility of Google-digitized volumes dependent on whether researcher's home institution has signed agreementrsync

Zipped text files in PairTree directory structure

Uncorrected OCR text dataCustom dataset request application

  • No labels