Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Contents

Table of Contents
maxLevel4

Other EF Information

Children Display

Introduction

The HTRC Extracted Features (EF) dataset contains informative characteristics, at the page level, of text from public domain

HTRC Extracted Features


Panel

HTRC Extracted Features datasets consist of metadata and derived data elements that have been extracted from volumes in the HathiTrust Digital

LIbrary (HTDL). These are slightly more than 5 million volumes, representing about 38% of the total digital content of the HTDL.

Rationale

Texts from the HTDL corpus that are not in the public domain are not available for download, which limits the usefulness of the corpus for research. However,  a great deal of fruitful research, especially in the form of text mining, can be performed on the basis of non-consumptive reading using extracted features (features extracted from the text) even when the full text is not available. To this end, the HathiTrust Research Center (HTRC) has started making available a set of page-level features extracted from the HTDL's public domain volumes. These extracted features can be the basis for certain kinds of  algorithmic analysis. For example, since topic modeling algorithms work with "bags of words" (sets of tokens), and since tokens and their frequencies are now being provided as extracted features, the EF dataset can enable a user to perform topic modeling with the data.

Worksets and the Extracted Features (EF) Dataset

Currently, the extracted features dataset is being provided in connection with worksets. (If you are not familiar with HathiTrust worksets, you may want to review the HTRC Workset Builder tutorial.)

The EF dataset for any HTRC workset can be retrieved as follows. A user first creates a workset (or chooses an existing workset) from the HTRC Portal. The EF datasets for the workset are transferred via rsync, a robust file synchronization/transfer utility. The user executes the EF rsync script generator algorithm (available as one of the algorithms provided at the HTRC Portal) with that workset. This produces a script that the user can then download and execute on his/her own machine. When executed on the user’s machine, the script transfers the EF data files for that workset from the HTRC’s server to the user’s hard disk, resulting in a file for each volume. The EF data is in JSON (JavaScript Object Notation) format — a commonly used lightweight data interchange format.

Content of an EF Dataset

An EF data file for a volume consists of volume-level metadata, and of the extracted feature data for each page in the volume, in JSON format.  The volume-level metadata consists of both volume metadata  (metadata about the volume) and  extracted features metadata (metadata about the extracted features).

Volume Metadata 

  1. schemaVersion: A version identifier for the format and structure of this metadata object.
  2. dateCreated: The time this metadata object was processed.
  3. title: Title of the given volume.
  4. pubDate: The publication year.
  5. language: Primary language of the given volume.
  6. htBibUrl: HT Bibliographic API call for the volume.
  7. handleUrl: The persistent identifier for the volume 
  8. oclc: The array of OCLC number(s).
  9. imprint: The publication place, publisher, and publication date of the given volume.

 

Extracted Features Metadata

  1. schemaVersion: A version identifier for the format and structure of the feature data.
  2. dateCreated: The time the batch of metadata was processed and recorded.
  3. pageCount: The number of pages in the volume.
  4. pages: An array of JSON objects, each representing a page of the volume. 

Extracted Features Data

The extracted features that HTRC currently provides include part-of-speech (POS) -tagged token counts, header and footer identification, and various  information at the line level.   Each page is broken up into three page sections: header, body, and footer.

For each page, the following are provided:

  1. seq: A sequence number (pertaining to the page’s position)
  2. tokenCount: Number of tokens in the page.
  3. lineCount: Number of non-empty lines in the page.
  4. emptyLineCount: Number of empty lines in the page.
  5. sentenceCount: Number of sentences identified in page (using the open-source OpenNLP software).
  6. languages: List of languages and their respective percentage that were identified on this page.

The corresponding fields for a page section (header, body, or footer) have the same names, but are for that specific page section:

  1. tokenCount: Number of tokens in that page section.
  2. lineCount: Number of lines containing characters of any kind in that page section.
  3. emptyLineCount: Number of lines without text in that page section.
  4. sentenceCount: Number of sentences found in the text in that page section, parsed using OpenNLP.
  5. tokenPosCount: An unordered list of all tokens (characterized by part of speech using OpenNLP), and their corresponding frequency counts, in that page section. 
  6. beginLineChars: Count of the initial character of each line in that page section (ignoring whitespace).
  7. endLineChars: Count of the last character on each line in that page section (ignoring whitespace).
  8. capAlphaSeq: (body only) Maximum length of the alphabetical sequence of capital characters starting a line.
 
Notes: 
  1. Correction of hyphenation of tokens at ends of lines has been carried out, but no additional data cleaning or OCR correction has been performed. 
  2. Token information has been provided at the page level because this makes it possible to separate paratext from text — e.g. identify pages of publishers’ ads at the back of a book.

Example

Code Block
languagejs
titleExample EF data for basic features for a single page
{ "id":"loc.ark:/13960/t1fj34w02", "metadata":{ "schemaVersion":"1.2", "dateCreated":"2015-02-12T13:30", "title":"Shakespeare's Romeo and Juliet,", "pubDate":"1920", "language":"eng", "htBibUrl":"http://catalog.hathitrust.org/api/volumes/full/htid/loc.ark:/13960/t1fj34w02.json", "handleUrl":"http://hdl.handle.net/2027/loc.ark:/13960/t1fj34w02", "oclc":"", "imprint":"Scott Foresman and company, [c1920]" }, "features":{ "schemaVersion":"2.0", "dateCreated":"2015-02-20T11:31", "pageCount":230, "pages":[ {"seq":"00000015", “tokenCount":212, "lineCount":38, "emptyLineCount":10, "sentenceCount":7, "languages":[{"en":"1.00"}], "header":{ "tokenCount":7, "lineCount":3, "emptyLineCount":1, "sentenceCount":1, "tokenPosCount":{ "I.":{"NN":1}, "THE":{"DT":1}, "INTRODUCTION":{"NN":1}, "DRAMA":{"NNPS":1}, "SHAKESPEARE":{"NNP":1}, "ENGLISH":{"NNP":1}, "AND":{"CC":1}}}, "body":{ "tokenCount":205, "lineCount":35, "emptyLineCount":9, "sentenceCount":6, "tokenPosCount":{ "striking":{"JJ":1}, "his":{"PRP$":1}, "plays":{"NNS":1}, "London":{"NNP":1}, "four":{"CD":1}, ".":{".":7}, "dramatic":{"JJ":2}, "1576":{"CD":1}, "stands":{"VBZ":1}, ... "growth":{"NN":1} } }, "footer":{ "tokenCount":0, "lineCount":0, "emptyLineCount":0, "sentenceCount":0, "tokenPosCount":{}}}]}}

Library. The dataset is periodically updated, including adding new volumes and adjusting the file schema. When we update the dataset, we create a new version. The current version is v.2.0.

Button Hyperlink
titleDownload the data
typeprimary
urlDownloading Extracted Features
Button Hyperlink
titleFollow a tutorial
typestandard
urlExtracted Features Use Cases and Examples


The basics

A great deal of useful research can be performed with features extracted from the full text volumes. For this reason, we generate and share a dataset called the HTRC Extracted Features. The current version of the dataset is Extracted Features 2.0. Each Extracted Features file that is generated corresponds to a volume from the HathiTrust Digital Library. The files are in JSON-LD format.

An Extracted Features file has two main parts:

Metadata

Each file begins with bibliographic and other metadata describing the volume represented by the Extracted Features file. 

Features

Features are notable or informative characteristics of the text. The features include:

  • Token (word) counts that have been tagged with part-of-speech in order to disambiguate homophones and enable a greater variety of analyses
  • Various line-level information, such as the number of lines with text on each page, and a count of characters that start and end lines on each page
  • Header and footer identification for cleaner data. 

Within each Extracted Features file, features are provided per-page to make it possible to separate text from paratext. For instance, feature information could aid in identifying publishers' ads at the back of a book.

Examples and tutorials

Tools for working with HTRC Extracted Features

The versions

Version 1.5

NOTE: this dataset has been superseded by Extracted Features versions above.

Documentation

Get the data

Version 0.2

NOTE: this dataset has been superseded by Extracted Features versions above.

Documentation

Get the data


Partner-created derived datasets


Panel

HTRC has partnered with researchers to create other derived datasets from the HathiTrust corpus. Follow the links below to learn more and access the data.


NovelTM Datasets for English-Language Fiction, 1700-2009 (Ted Underwood, Patrick Kimutis, Jessica Witte)

Description

This dataset is descriptive metadata for 210,305 volumes of English-language fiction in HathiTrust Digital Library. Nineteenth- and twentieth-century fiction are also divided into seven subsets with different emphases (for instance, one where men and women are represented equally, and one composed of only the most prominent and widely-held books). Fiction was identified using a mixed approach of metadata and predictive modeling based on human-assigned ground truth. A full description of the dataset and its creation is available in the dataset report linked below.

Read the report

Get the data from GitHub

Get the data from Zenodo

Word Frequencies in English-Language Literature, 1700-1922 (Ted Underwood)

Description

This dataset contains the word frequencies for all English-language volumes of fiction, drama, and poetry in the HathiTrust Digital Library from 1700 to 1922. Word counts are aggregated at the volume level, but include only pages tagged as belonging to the relevant literary genre. Fiction was identified using a mixed approach of metadata and predictive modeling based on human-assigned ground truth. A full explanation of the dataset's features, motivation, and creation is available on the dataset documentation page below.

Documentation

Get the data

Geographic Locations in English-Language Literature, 1701-2011 (Matthew Wilkens)

Description

The dataset contains volume metadata as well as geographical locations and the number of times the location is mentioned in the text of works of fiction written in English from 1701 - 2011  that are found in the HathiTrust Digital Library. This dataset relied on Ted Underwood’s novelTM dataset to determine which volumes to include, and it is part of Matthew Wilkens' larger Textual Geographies Project. Information about the Textual Geographies Project can be found at the Textual Geographies Project link below. A full explanation of the Textual Geographies in English Literature dataset is available at the documentation link below.

Textual Geographies Project

Documentation

Get the data