HTRC provides access to data from the HathiTrust corpus in several forms across its suite of tools and services for computational text analysis. Data is periodically synced from HathiTrust, but not all HTRC tools and services are updated on the same schedule. Additionally, copyright and security concerns impact data availability and format.
The following table disambiguates textual data access, including availability and format, within the HTRC ecosystem.
|Tool or service|
# volumes (as of 3/18)
|rights status||data access mechanism||file format||data format||permissions required|
|HTRC Analytics algorithms||5.9 million||Public domain||Via HTRC Workset; researcher runs tool without accessing underlying data||(no file access)|
Uncorrected OCR text data
|HTRC Analytics account|
|HT+Bookworm tool||13.7 million||All||Via web-interface; researcher visualizes data without accessing underlying data||(no file access)||Unigrams (single words), based on HTRC Extracted Features dataset||(none)|
|HTRC Data Capsule||5.9 million||Public domain||HTRC Data API||Zipped text files in PairTree directory structure||Uncorrected OCR text data||HTRC Analytics account|
|HTRC Extracted Features dataset||15.7 million||All||rsync||JSON files in PairTree directory structure||Volume- and page-level metadata and part-of-speech tagged page-level "bags of words"||(none)|
|HathiTrust custom dataset||6 million, dependent on institutional agreements||Public domain; accessibility of Google-digitized volumes dependent on whether researcher's home institution has signed agreement||rsync|
Zipped text files in PairTree directory structure
|Uncorrected OCR text data||Custom dataset request application|