HTRC Extracted Features
A great deal of useful research can be performed with features extracted from the full text volumes. For this reason, we generate and share a dataset called the HTRC Extracted Features. The current version of the dataset is Extracted Features 2.0. Each Extracted Features file that is generated corresponds to a volume from the HathiTrust Digital Library. The files are in JSON-LD format.
An Extracted Features file has two main parts:
Each file begins with bibliographic and other metadata describing the volume represented by the Extracted Features file.
Features are notable or informative characteristics of the text. The features include:
- Token (word) counts that have been tagged with part-of-speech in order to disambiguate homophones and enable a greater variety of analyses
- Various line-level information, such as the number of lines with text on each page, and a count of characters that start and end lines on each page
- Header and footer identification for cleaner data.
Within each Extracted Features file, features are provided per-page to make it possible to separate text from paratext. For instance, feature information could aid in identifying publishers' ads at the back of a book.