Skip to end of metadata
Go to start of metadata

HTRC provides access to data from the HathiTrust corpus in several forms across its suite of tools and services for computational text analysis. Data is periodically synced from HathiTrust, but not all HTRC tools and services are updated on the same schedule. Additionally, copyright, user agreements, and security concerns impact data availability and format. 

  • HTRC algorithms and HTRC Data Capsules have the capability of analyzing the entire HathiTrust corpus, and additionally make use of each volume’s MARC bibliographic and METS metadata. Both the HTRC algorithms and Capsule-environments draw from the HTRC Data API described below.

  • The HTRC makes available also two datasets, the HTRC Extracted Features Dataset and a dataset of Word Frequencies in English Language Literature, 1700-1922. HTRC Extracted Features includes metadata and extracted page-level data (words and word counts) for 13.7 million volumes.

  • HathiTrust+Bookworm visualizes data for 15.7 million volumes.

HathiTrust text data

Access options

The following table disambiguates textual data access, including availability and format, within the HTRC ecosystem. 

Tool or service

# volumes (as of 3/18)

rights statusdata access mechanismfile formatdata formatpermissions required
HTRC Analytics algorithms5.9 millionAllVia HTRC Workset; researcher runs tool without accessing underlying data(no file access)

Uncorrected OCR text data; only results are exposed to researcher (not the underlying data)

HTRC Analytics account
HT+Bookworm tool13.7 millionAllVia web-interface; researcher visualizes data without accessing underlying data(no file access)Unigrams (single words), based on HTRC Extracted Features dataset; underlying data not exposed to researcher(none)
HTRC Data Capsule5.9 millionPublic domain for everyone; In-copyright limited to affiliates of HT member institutions upon approval of requestHTRC Data APIZipped text files in PairTree directory structureUncorrected OCR text dataHTRC Analytics account
HTRC Extracted Features dataset15.7 millionAllrsyncJSON files in PairTree directory structureVolume- and page-level metadata and part-of-speech tagged page-level "bags of words"(none)
HathiTrust custom dataset6 million, dependent on institutional agreementsPublic domain; accessibility of Google-digitized volumes dependent on whether researcher's home institution has signed agreementrsync

Zipped text files in PairTree directory structure

Uncorrected OCR text dataCustom dataset request application

Building a workset/dataset of HathiTrust text data


This table outlines the differences between the HTRC Data API and HathiTrust Data API. As a researcher-accessible service, HTRC Data API functions within the HTRC Data Capsules environment.



HathiTrust Data API

purposeto provide access to HT tex data within and HTRC Data Capsule AND to serve high-performance large-scale algorithms and programs (not user-accessible)to provide public users some volume retrieval capabilities
useIn-capsule via the HTRC Workset ToolkitGet more information from HathiTrust
throttling enforcementnoyes
bulk retrieval of volumesyesno
metadata availableMETSMETS, MARC
  • No labels