HTRC Derived Datasets

HTRC Derived Datasets are structured sets of metadata representing a curated collection of HathiTrust volumes. Read about the basics of our Extracted Features and partner-created datasets here.

HTRC Extracted Features

HTRC Extracted Features datasets consist of metadata and derived data elements that have been extracted from volumes in the HathiTrust Digital Library. The dataset is periodically updated, including adding new volumes and adjusting the file schema. When we update the dataset, we create a new version. The current version is v.2.0.


The basics

A great deal of useful research can be performed with features extracted from the full text volumes. For this reason, we generate and share a dataset called the HTRC Extracted Features. The current version of the dataset is Extracted Features 2.0. Each Extracted Features file that is generated corresponds to a volume from the HathiTrust Digital Library. The files are in JSON-LD format.

An Extracted Features file has two main parts:

Metadata

Each file begins with bibliographic and other metadata describing the volume represented by the Extracted Features file. 

Features

Features are notable or informative characteristics of the text. The features include:

  • Token (word) counts that have been tagged with part-of-speech in order to disambiguate homophones and enable a greater variety of analyses
  • Various line-level information, such as the number of lines with text on each page, and a count of characters that start and end lines on each page
  • Header and footer identification for cleaner data. 

Within each Extracted Features file, features are provided per-page to make it possible to separate text from paratext. For instance, feature information could aid in identifying publishers' ads at the back of a book.

Examples and tutorials

Tools for working with HTRC Extracted Features

The versions

Version 1.5

NOTE: this dataset has been superseded by Extracted Features versions above.

Documentation

Get the data

Version 0.2

NOTE: this dataset has been superseded by Extracted Features versions above.

Documentation

Get the data


HTRC BookNLP Dataset for English-Language Fiction

Rich, unrestricted entity, word, and character data extracted from ~213,000 volumes of English-language fiction in the HTDL

 

The basics

The HTRC BookNLP Dataset for English-Language Fiction (ELF) derived dataset was created using the BookNLP pipeline, extracting data from the NovelTM English-language fiction set, a supervised machine learning-derived set of around 213,000 volumes in the HathiTrust Digital Library. BookNLP is a text analysis pipeline tailored for common natural language processing (NLP) tasks to empower work in computational linguistics, cultural analytics, NLP, machine learning, and other fields.

This dataset is modified from the standard BookNLP pipeline to output only files that meet HTRC's non-consumptive use policy that requires minimal data that cannot be easily reconstructed into the raw volume to be released. Please see the Data section below for specifics on what files are included and their description.

Process

BookNLP is a pipeline that combines state-of-the-art tools for a number of routine cultural analytics or NLP tasks, optimized for large volumes of text, including (verbatim from BookNLP’s GitHub documentation):

  • Part-of-speech tagging
  • Dependency parsing
  • Entity recognition
  • Character name clustering (e.g., "Tom", "Tom Sawyer", "Mr. Sawyer", "Thomas Sawyer" -> TOM_SAWYER) and coreference resolution
  • Quotation speaker identification
  • Supersense tagging (e.g., "animal", "artifact", "body", "cognition", etc.)
  • Event tagging
  • Referential gender inference (TOM_SAWYER -> he/him/his)

This dataset was generated by running each volume in the NovelTM English-Language Fiction dataset, sourced from the HathiTrust Digital Library, through the BookNLP pipeline, generating rich derived data for each volume.

Files

For each book run through the pipeline, this dataset contains the 3 following files:

  1. The .entities file: contains state-of-the-art tagged entities identified by a predictive model fine-tuned to identify and extract entities from narrative text. Entity types that are tagged include: people, facilities, geo-political entities, locations, vehicles, and organizations.
  2. The .supersense file: contains tagged tokens and phrases that represent one of 41 lexical categories reflected in the computational linguistics database WordNet. These tags represent fine-grained semantic meaning for each token or phrase within the sentence in which they occur.
  3. The .book file: contains a large JSON array with “characters” (fictional agents in the volumes, e.g. "Ebenezer Scrooge") as the main key, and then information about each character mentioned more than once in the text. This data includes these classes of information, by character:
    • All of the names with which the character is associated, including pronouns, to disambiguate mentions in the text. From an excerpt like, “I mean that's all I told D.B. about, and he's my brother and all” this means the data for the character “D.B.” will include the words associated with that name explicitly in the text and the pronoun “he” in this sentence. (under the label "mentions" in the JSON)
    • Words that are used to describe the character ("mod")
    • Nouns the character possesses ("poss")
    • Actions the character does (labeled "agent")
    • Actions done to the character ("patient")
    • An inferred gender label for the character ("g")


Partner-created derived datasets

HTRC has partnered with researchers to create other derived datasets from the HathiTrust corpus. Follow the links below to learn more and access the data.


NovelTM Datasets for English-Language Fiction, 1700-2009 (Ted Underwood, Patrick Kimutis, Jessica Witte)

Description

This dataset is descriptive metadata for 210,305 volumes of English-language fiction in HathiTrust Digital Library. Nineteenth- and twentieth-century fiction are also divided into seven subsets with different emphases (for instance, one where men and women are represented equally, and one composed of only the most prominent and widely-held books). Fiction was identified using a mixed approach of metadata and predictive modeling based on human-assigned ground truth. A full description of the dataset and its creation is available in the dataset report linked below.

Read the report

Get the data from GitHub

Get the data from Zenodo

Word Frequencies in English-Language Literature, 1700-1922 (Ted Underwood)

Description

This dataset contains the word frequencies for all English-language volumes of fiction, drama, and poetry in the HathiTrust Digital Library from 1700 to 1922. Word counts are aggregated at the volume level, but include only pages tagged as belonging to the relevant literary genre. Fiction was identified using a mixed approach of metadata and predictive modeling based on human-assigned ground truth. A full explanation of the dataset's features, motivation, and creation is available on the dataset documentation page below.

Documentation

Get the data

Geographic Locations in English-Language Literature, 1701-2011 (Matthew Wilkens)

Description

The dataset contains volume metadata as well as geographical locations and the number of times the location is mentioned in the text of works of fiction written in English from 1701 - 2011  that are found in the HathiTrust Digital Library. This dataset relied on Ted Underwood’s novelTM dataset to determine which volumes to include, and it is part of Matthew Wilkens' larger Textual Geographies Project. Information about the Textual Geographies Project can be found at the Textual Geographies Project link below. A full explanation of the Textual Geographies in English Literature dataset is available at the documentation link below.

Textual Geographies Project

Documentation

Get the data