Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.


Metadata about the extracted features consists of the following pieces of data:


1.schemaVersion: A version identifier for the format and structure of the feature data.
2.dateCreated: The time the batch of metadata was processed and recorded.
3.pageCount: The number of pages in the volume.
4.pages: An array of JSON objects, each representing a page of the volume. 


“Extracted features” data:




The extracted features that HTRC currently provides include part-of-speech (POS) -tagged token counts, header and footer identification, and various line-level information. (Providing token information at the page level makes it possible to separate paratext from text — e.g. identify pages of publishers’ ads at the back of a book.)  Each page is broken up into three parts: header, body, and footer. Correction of hyphenation of tokens at end of lines has been carried out, but not any additional data cleaning or OCR correction. 

“Basic” features


1.seq: A sequence number (pertaining to the page’s position)


5.tokenPosCount: An unordered list of all tokens (characterized by part of speech using OpenNLP), and their corresponding frequency counts, in this page section.  

“Advanced” features

This information is provided to help clarify genre and volume structure; for instance, it can help distinguish poetry from prose, or body text from an index.


4.capAlphaSeq: (body only) Maximum length of the alphabetical sequence of capital characters starting a line.


A simplified EF data file for basic features, with metadata and features for a single page: