Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 38 Next »


The HTRC releases research datasets to facilitate text analysis using the HathiTrust Digital Library (HTDL). While copyright-protected texts are not available for download from the HathiTrust, fruitful research can still be performed on the basis of non-consumptive analysis of features extracted from full text. To this end, the HathiTrust Research Center (HTRC) has makes available page-level features extracted from volumes in the HTDL. These features include volume-level metadata, page-level metadata, part-of-speech-tagged tokens, and token counts. Additionally, the HTRC has partnered with advanced researchers to release a derived dataset,  Word Frequencies in English-Language Literature, 1700-1922.

Getting Started

Downloading Extracted Features

Use Cases and Examples

Extracted Features in the Wild

Extracted Features Dataset 


Get the data

Word Frequencies in English-Language Literature, 1700-1922 


Get the data

Extracted Features Dataset [v.0.2]

NOTE: this dataset has been superseded by Extracted Features Dataset [v.1.0], above.


Get the data


  • No labels