Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: added "LD" to JSON, and a further reason for POS tagging
Introduction

HTRC Extracted Features


Panel

HTRC

releases research datasets to facilitate text analysis using

Extracted Features datasets consist of metadata and derived data elements that have been extracted from volumes in the HathiTrust Digital Library.

While copyright-protected texts are not available for download from HathiTrust, fruitful research can still be performed on the basis of non-consumptive analysis of features extracted from full text. These features include volume-level metadata, page-level metadata, part-of-speech-tagged tokens, and token counts. Additionally, HTRC has partnered with advanced researchers to release a derived dataset, Word Frequencies in English-Language Literature, 1700-1922.

Getting Started

Downloading Extracted Features

The dataset is periodically updated, including adding new volumes and adjusting the file schema. When we update the dataset, we create a new version. The current version is v.2.0.

Button Hyperlink
titleDownload the data
typeprimary
urlDownloading Extracted Features
Button Hyperlink
titleFollow a tutorial
typestandard
urlExtracted Features Use Cases and Examples


The basics

A great deal of useful research can be performed with features extracted from the full text volumes. For this reason, we generate and share a dataset called the HTRC Extracted Features. The current version of the dataset is Extracted Features 2.0. Each Extracted Features file that is generated corresponds to a volume from the HathiTrust Digital Library. The files are in JSON-LD format.

An Extracted Features file has two main parts:

Metadata

Each file begins with bibliographic and other metadata describing the volume represented by the Extracted Features file. 

Features

Features are notable or informative characteristics of the text. The features include:

  • Token (word) counts that have been tagged with part-of-speech in order to disambiguate homophones and enable a greater variety of analyses
  • Various line-level information, such as the number of lines with text on each page, and a count of characters that start and end lines on each page
  • Header and footer identification for cleaner data. 

Within each Extracted Features file, features are provided per-page to make it possible to separate text from paratext. For instance, feature information could aid in identifying publishers' ads at the back of a book.

Examples and tutorials

Tools for working with HTRC Extracted Features

Dataset 

The versions

Version 2.0 (current)

Documentation

Basic walk-through of an Extracted Features 2.0 file

Get the data

Word Frequencies in English-Language Literature, 1700-1922 

Version 1.5

NOTE: this dataset has been superseded by Extracted Features versions above.

Documentation

Get the data

Extracted Features Dataset [v.

Version 0.2

]

NOTE: this dataset has been superseded by Extracted Features Dataset [v.1.0], above.versions above.

Documentation

Get the data


Partner-created derived datasets


Panel

HTRC has partnered with researchers to create other derived datasets from the HathiTrust corpus.


Word Frequencies in English-Language Literature, 1700-1922 (Ted Underwood)


Documentation

Get the data