HTRC Extracted Features datasets consist of metadata and derived data elements that have been extracted from volumes in the HathiTrust Digital Library. The dataset is periodically updated, including adding new volumes and adjusting the file schema. When we update the dataset, we create a new version. The current version is v.2.0.
A great deal of useful research can be performed with features extracted from the full text volumes. For this reason, we generate and share a dataset called the HTRC Extracted Features. The current version of the dataset is Extracted Features 2.0. Each Extracted Features file that is generated corresponds to a volume from the HathiTrust Digital Library. The files are in JSON-LD format.
An Extracted Features file has two main parts:
Each file begins with bibliographic and other metadata describing the volume represented by the Extracted Features file.
Features are notable or informative characteristics of the text. The features include:
Within each Extracted Features file, features are provided per-page to make it possible to separate text from paratext. For instance, feature information could aid in identifying publishers' ads at the back of a book.
HTRC has partnered with researchers to create other derived datasets from the HathiTrust corpus. Follow the links below to learn more and access the data.
This dataset contains the word frequencies for all English-language volumes of fiction, drama, and poetry in the HathiTrust Digital Library from 1700 to 1922. Word counts are aggregated at the volume level, but include only pages tagged as belonging to the relevant literary genre. Fiction was identified using a mixed approach of metadata and predictive modeling based on human-assigned ground truth. A full explanation of the dataset's features, motivation, and creation is available on the dataset documentation page below.