Child pages
  • Extracted Features [v.2.0]
Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

*** Dataset release coming soon!***

HTRC Extracted Features 2.0 is the most current version of a derived dataset consisting of metadata and data elements extracted from volumes in the HathiTrust Digital Library. The dataset is composed of 17.2 million JSON files representing a snapshot of the HathiTrust corpus from February 2020. 

This documentation describes the structure and data in the HTRC Extracted Features 2.0 files for users of those files. The specific features extracted are described in more detail below.

You can also refer to technical documentation of the Extracted Features 3.0 JSON-LD schema here. The schema was developed collaboratively with JSTOR-Portico, and it could be applied to data from non-HathiTrust sources to create compatible datasets. This version of the extracted features vocabulary is designed as a linked data standard (JSON-LD). 

Downloading the files

Coming soon!



Data Stats

# of volumes represented

17,123,746

# of pages represented
6,221,631,336
# of tokens represented
2,906,819,723,689
# files derived from in-copyright volumes10,550,952
# pages derived from in-copyright volumes
2,913,069,029,723
# tokens derived from in-copyright volumes
5,826,138,059,446
# files derived from public domain & Creative Commons volumes
6,572,794
# pages derived from public domain & Creative Commons volumes
2,478,069,869
# tokens derived from public domain & Creative Commons volumes
1,197,647,030,798
  • No labels