Get the Data
The data is accessible using rsync. Rsync should be installed already on your Mac or Linux system; Windows users can use it through Cygwin.
A sample of 100 extracted feature files is available for download through your browser: sample-beta.zip.
Rsync will download each feature file individually, following a pairtree directory structure.
To sync *all* the feature files:
rsync -av::pd-features/ .
Note that this data is 1.2 Terabytes! Only download all of it if you know what you are doing.
A randomly sorted listing of all the files is available in the following location:
rsync -azv sandbox.htrc.illinois.edu::pd-features/listing/pd-file-listing.txt .
Users hoping for a more flexible file listings can use rsync's --list-only flag.
To rsync only the files in a given text file:
rsync -av --files-from FILE.TXT sandbox.htrc.illinois.edu::pd-features/ .
Volume feature files use the volume's ID, with the following characters substituted: : to +, and / to =. This means that any list of HathiTrust public domain files can be used to download the corresponding feature files.
To download the Extracted Features data for a specific workset in the HTRC Portal, there is an algorithm that generates the Rsync download script: EF Rsync Script Generator.
Computation was supported by the Cline Center for Democracy through its allocation on the Blue Waters sustained-petascale computing project, which is funded by the National Science Foundation (awards OCI-0725070 and ACI-1238993) and the state of Illinois. Blue Waters is a joint effort of the University of Illinois at Urbana-Champaign and its National Center for Supercomputing Applications.
This release has been made possible, in part, by the National Endowment for the Humanities: Celebrating 50 Years of Excellence. Any views, findings, conclusions, or recommendations expressed in this release do not necessarily represent those of the National Endowment for the Humanities.
How are tokens parsed?
Hyphenation of tokens at end of line was corrected using custom code. Apache OpenNLP was used for sentence segmentation, tokenization, and part of speech (POS) tagging. No additional data cleaning or OCR correction was performed.
OpenNLP uses the Penn Treebank POS tags.
Can I use the page sequence as a unique identifier?
The seq value is always sequential from the start. Each scanned page of a volume has a unique sequence number, but it is specific to the current version of the full text. In theory, updates to the OCR that add or remove pages will change the sequence. The practical likelihood of changes in the sequence is low, but uses of the page as an id should be cautious.
A future release of this data will include persistent page identifiers that remain unchanged even when page sequence changes.
Where is the bibliographic metadata? Who wrote the book?; When was it published, etc.?
This dataset is foremost an extracted features dataset, with minimal metadata included as a convenience. For additional metadata information, i.e. subject classifications, etc., HT offers Hathifiles, which can be paired to our feature dataset through the volume id field.
The metadata that is included in this data includes MARC metadata from HathiTrust and additional information from Hathifiles:
- imprint: 260a from HathiTrust MARC record, 260b and 260c from Hathifiles.
- language: MARC control field 008 from Hathifiles.
- pubDate: extracted from Hathifiles. See also: details on HathiTrust's rights-determination.
- oclc: extracted from Hathifiles.
Additionally, schemaVersion and dateCreated are specific to this feature dataset.
What do I do with beginning- or end-of-line characters?
The characters at the start and end of a line can be used to differentiate text from paratext at a page level. For instance, index lines tend to begin with capitalized letters and end with numbers. Likewise, lines in a table of contents can be identified through arabic or roman numerals at the start of a line.
What is the difference between the header, body, and footer sections?
Because repeated headers and footers can distort word counts in a document, but also help identify document parts, we attempt to identify repeated lines at the top or bottom of a page and provide separate token counts for those forms of paratext. The "header" and "footer" sections will also include tokens that are page numbers, catchwords, or other short lines at the very top or bottom of a page. Users can of course ignore these divisions by aggregating the token counts for header, body, and footer sections.
If you've built tools or scripts for processing our data, let us know and we'll feature them here!
Let us know about your projects and we'll link to them here.