Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.


An EF data file for a volume consists of volume-level metadata, and of the extracted feature data for each page in the volume, in JSON format.  The volume-level metadata consists of both metadata about the volume and metadata about the extracted features.

Metadata about the volume consists of the following pieces of data:




1.schemaVersion: A version identifier for the format and structure of this metadata object.


2.dateCreated: The time this metadata object was processed.


3.title: Title of the given volume.


4.pubDate: The publication year.


5.language: Primary language of the given volume.


6.htBibUrl: HT Bibliographic API call for the volume.


7.handleUrl: The persistent identifier for the volume.


8.oclc: The array of OCLC number(s).


9.imprint: The publication place, publisher, and publication date of the given volume.

Metadata about the extracted features consists of the following pieces of data:


1.schemaVersion: A version identifier for the format and structure of the feature data.
2.dateCreated: The time the batch of metadata was processed and recorded.
3.pageCount: The number of pages in the volume.
4.pages: An array of JSON objects, each representing a page of the volume. 


“Extracted features” data:




The extracted features that HTRC currently provides include part-of-speech (POS) -tagged token counts, header and footer identification, and various line-level information. (Providing token information at the page level makes it possible to separate paratext from text — e.g. identify pages of publishers’ ads at the back of a book.)  Each page is broken up into three parts: header, body, and footer. Correction of hyphenation of tokens at end of lines has been carried out, but not any additional data cleaning or OCR correction. 

“Basic” features

1.seq: A sequence number (pertaining to the page’s position)
2.tokenCount: Number of tokens in the page.
3.lineCount: Number of non-empty lines in the page.
4.emptyLineCount: Number of empty lines in the page.
5.sentenceCount: Number of sentences identified in page (using the open-source OpenNLP software).
6.languages: List of languages and their respective percentage that were identified on this page.

The corresponding fields for header, body, and footer are the same, but apply to different parts of the page.


1.tokenCount: Number of tokens in this page section.
2.lineCount: Number of lines containing characters of any kind in this page section.
3.emptyLineCount: Number of lines without text in this page section.
4.sentenceCount: Number of sentences found in the text in this page section, parsed using OpenNLP.
5.tokenPosCount: An unordered list of all tokens (characterized by part of speech using OpenNLP), and their corresponding frequency counts, in this page section. 


“Advanced” features

This information is provided to help clarify genre and volume structure; for instance, it can help distinguish poetry from prose, or body text from an index.


1.seq: A sequence number (same as in “basic” features)
2.beginLineChars: Count of the initial character of each line in this page section (ignoring whitespace).
3.endLineChars: Count of the last character on each line in this page section (ignoring whitespace).
4.capAlphaSeq: (body only) Maximum length of the alphabetical sequence of capital characters starting a line.


A simplified EF data file for basic features, with metadata and features for a single page:


{  "id":"loc.ark:/13960/t1fj34w02",




      "title":"Shakespeare's Romeo and Juliet,",






      "imprint":"Scott Foresman and company, [c1920]"