Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.


The EF dataset for any HTRC workset can be retrieved as follows. A user first creates a workset (or choose chooses an existing workset) from the HTRC Portal. The EF datasets for the workset are transferred via rsync, a robust file synchronization/transfer utility. The user executes the EF rsync script generator algorithm (available as one of the algorithms provided at the HTRC Portal) with that workset. This produces a script that the user can then download and execute on his/her own machine. When executed on the user’s machine, the script transfers the EF data files for that workset from the HTRC’s server to the user’s hard disk, resulting, for each volume in the selected workset, in two zipped files containing “basic” and “advanced” EF data. The EF data is in JSON (JavaScript Object Notation) format — a commonly used lightweight data interchange format.


An EF data file for a volume consists of volume-level metadata, and of the extracted feature data for each page in the volume, in JSON format.  The volume-level metadata consists of both volume metadata  (metadata about the volume) and  extracted features metadata (metadata about the extracted features).

Volume MetadataMetadata 

  1. schemaVersion: A version identifier for the format and structure of this metadata object.
  2. dateCreated: The time this metadata object was processed.
  3. title: Title of the given volume.
  4. pubDate: The publication year.
  5. language: Primary language of the given volume.
  6. htBibUrl: HT Bibliographic API call for the volume.
  7. handleUrl: The persistent identifier for the volume 
  8. oclc: The array of OCLC number(s).
  9. imprint: The publication place, publisher, and publication date of the given volume.