Introduction
The Extracted Features (EF) dataset contains informative characteristics of the text at the page level from public domain volumes in the HathiTrust Digital LIbrary (HTDL). THese are slightly more than 5 million volumes, representing about 38% of the total digital content of the HTDL.
Rationale
Texts from the HTDL corpus that is not in the public domain are not available for download, which limits its usefulness for research. However, a great deal of fruitful research, especially in the form of text mining, can be performed on the basis of non-consumptive reading using extracted features (features extracted from the text) even when the full text is not available. To this end, the HathiTrust Research Center (HTRC) has started making available a set of page-level features extracted from the HTDL's public domain volumes. These extracted features can be the basis for certain kinds of algorithmic analysis. For example, topic modeling works with bags of words (sets of tokens). Since tokens and their frequencies are provided as features, the EF dataset can enable a user to perform topic modeling with the data.
Worksets and the Extracted Features (EF) Dataset
Currently, the extracted features dataset is being provided in connection with worksets. (If you are not familiar with HathiTrust worksets, you may want to review the tutorial available elsewhere in this Wiki regarding the HTRC Workset Builder.)
The EF dataset for any HTRC workset can be retrieved as follows. A user first creates a workset (or choose an existing workset) from the HTRC Portal. The EF datasets for the workset are transferred via rsync, a robust file synchronization/transfer utility. The user executes the EF rsync script generator algorithm (available as one of the algorithms provided at the HTRC Portal) with that workset. This produces a script that the user can then download and execute on his/her own machine. When executed on the user’s machine, the script transfers the EF data files for that workset from the HTRC’s server to the user’s hard disk, resulting, for each volume in the selected workset, in two zipped files containing “basic” and “advanced” EF data. The EF data is in JSON (JavaScript Object Notation) format — a commonly used lightweight data interchange format.
Content of an EF Dataset
An EF data file for a volume consists of volume-level metadata, and of the extracted feature data for each page in the volume, in JSON format. The volume-level metadata consists of both metadata about the volume and metadata about the extracted features.
Metadata about the volume consists of the following pieces of data:
Metadata about the extracted features consists of the following pieces of data:
“Extracted features” data:
The extracted features that HTRC currently provides include part-of-speech (POS) -tagged token counts, header and footer identification, and various line-level information. (Providing token information at the page level makes it possible to separate paratext from text — e.g. identify pages of publishers’ ads at the back of a book.) Each page is broken up into three parts: header, body, and footer. Correction of hyphenation of tokens at end of lines has been carried out, but not any additional data cleaning or OCR correction.
“Basic” features
The corresponding fields for header, body, and footer are the same, but apply to different parts of the page.
“Advanced” features
This information is provided to help clarify genre and volume structure; for instance, it can help distinguish poetry from prose, or body text from an index.
A simplified EF data file for basic features, with metadata and features for a single page:
{ "id":"loc.ark:/13960/t1fj34w02",
"metadata":{
"schemaVersion":"1.2",
"dateCreated":"2015-02-12T13:30",
"title":"Shakespeare's Romeo and Juliet,",
"pubDate":"1920",
"language":"eng",
"htBibUrl":"http://catalog.hathitrust.org/api/volumes/full/htid/loc.ark:/13960/t1fj34w02.json",
"handleUrl":"http://hdl.handle.net/2027/loc.ark:/13960/t1fj34w02",
"oclc":"",
"imprint":"Scott Foresman and company, [c1920]"
},
"features":{
"schemaVersion":"2.0",
"dateCreated":"2015-02-20T11:31",
"pageCount":230,
"pages":[
{"seq":"00000015",
“tokenCount":212,
"lineCount":38,
"emptyLineCount":10,
"sentenceCount":7,
"languages":[{"en":"1.00"}],
"header":{
"tokenCount":7,
"lineCount":3,
"emptyLineCount":1,
"sentenceCount":1,
"tokenPosCount":{
"I.":{"NN":1},
"THE":{"DT":1},
"INTRODUCTION":{"NN":1},
"DRAMA":{"NNPS":1},
"SHAKESPEARE":{"NNP":1},
"ENGLISH":{"NNP":1},
"AND":{"CC":1}}},
"body":{
"tokenCount":205,
"lineCount":35,
"emptyLineCount":9,
"sentenceCount":6,
"tokenPosCount":{
"striking":{"JJ":1},
"his":{"PRP$":1},
"plays":{"NNS":1},
"London":{"NNP":1},
"four":{"CD":1},
".":{".":7},
"dramatic":{"JJ":2},
"1576":{"CD":1},
"stands":{"VBZ":1},
...
"growth":{"NN":1}
}
},
"footer":{
"tokenCount":0,
"lineCount":0,
"emptyLineCount":0,
"sentenceCount":0,
"tokenPosCount":{}}}]}}