Child pages
  • HTRC data access

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Tool or service

# volumes available (as of 8/19)

rights statusdata access mechanismfile formatdata formatpermissions required
HTRC Analytics algorithms17 millionAllVia HTRC Workset; researcher runs tool without accessing underlying data(no file access)

Uncorrected OCR text data; only results are exposed to researcher (not the underlying data)

HTRC Analytics account
HT+Bookworm tool15.7 millionAllVia web-interface; researcher visualizes data without accessing underlying data(no file access)Unigrams (single words), based on HTRC Extracted Features dataset; underlying data not exposed to researcher(none)
HTRC Data Capsule6.5 million for everyone; 17 million for affiliates of HT member institutions upon approval or requestPublic domain for everyone; In-copyright limited to affiliates of HT member institutions upon approval of requestHTRC Data APIZipped text files in PairTree directory structureUncorrected OCR text dataHTRC Analytics account
HTRC Extracted Features dataset15.7 millionAllrsyncJSON files in PairTree directory structureVolume- and page-level metadata and part-of-speech tagged page-level "bags of words"(none)
HathiTrust custom dataset request6.5 million, dependent on institutional agreementsPublic domain; accessibility of Google-digitized volumes dependent on whether researcher's home institution has signed agreementrsync

Zipped text files in PairTree directory structure

Uncorrected OCR text dataCustom dataset request application
HathiTrust Data API~800 thousand available; practical limit for retrieval is 10 thousand volumesPublic domain volumes not digitized by GoogleHathiTrust Data APIZipped text files in PairTree directory structureUncorrected OCR text data and page imagesKey required to use the API outside of the Web client: read the documentation

Building a workset/dataset of HathiTrust text data

...

HathiTrust and HTRC APIs

This table outlines the differences between the HTRC Data API and HathiTrust Data API. As a researcher-accessible service, HTRC Data API functions within the HTRC Data Capsules environment.



 

HTRC Data API

HathiTrust Data API

purposeto provide access to
HT tex
HathiTrust text data within
and
an HTRC Data Capsule AND to serve high-performance large-scale algorithms and programs (not
user
publicly-accessible)to provide public users some volume retrieval capabilities
data availableentire HathiTrust corpuspublic domain volumes not digitized by Google
useIn-capsule via the HTRC Workset ToolkitGet more information from HathiTrust
throttling enforcementnoyes
securityJWTOAuth
bulk retrieval of volumesyes
no
up to 10,000 volumes
metadata availableMETSMETS, MARC