Child pages
  • Basic walk-through of an Extracted Features 2.0 file

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

“LD” stands for linked data, which means that data in the files can be interlinked to other data sources via Uniform Resource Locator Identifiers (URIs), often in the form of Uniform Resource Locators (URLs), a kind of web link that you will find at various points in the file). Certain terms in the file are presented with a URL URI that links the entity to an ontology or vocabulary like Schema.org, or to authorities databases such as Virtual International Authority File (VIAF) or Library of Congress linked data resources. When no entry was found in a 3rd-party database, an entity was created for the name in a named entity database maintained by HTRC. 

...

If you expand the “context” data (click the arrow beside the term), the full URL URI is revealed:


You won’t need to know what every single term means in order to use these files, but familiarizing yourself with the various objects within a file is helpful in understanding what “extracted features” are and how you might want to use them in your research.

...

Some of these terms look familiar to those listed at the top of the file, such as “schemaVersion”, “id”, “type”, and “dateCreated”. These Some of these terms are still associated with this specific file , and not with ("schemaVersion" and "dateCreated"), but now others refer to the volume itself ("id" and "type"). (Again, refer to this doc here if you want to know the full extent of these terms’ meanings and what their values indicate.)

...

There is more information in the metadata section, and you can expand and minimize to explore to and see the full range of bibliographic metadata captured by the EF file. 

For example, there is additional information found in the ‘metadata’ node object like ‘pubPlace’, and inside ‘pubPlace’ there are even more nodes, like ‘language’ (expand the ‘pubPlace‘ node to see the additional nodes within). ‘pubPlace’ you have the node object "pubPlace". "pubPlace" indicates information related to where the bibliographic entity was first (or originally) published. This . If you expand "pubPlace", there are additional nodes nested underneath that hold attributes of "pubPlace": here a "name" of the pubPlace, "Israel", information about the metadata field (here under "type") and a URI stored in "id" that links to an authority file for the place, the URI for "Israel" from the Library of Congress Linked Data Service.

We also see more nodes under "metadata" that may be of interest, including "language" and "accessRights". The "language" node object will tell you what country language the volume was published in , and in what language using MARC language codes (“heb”, or Hebrew, in this case), or while "accessRights" holds information on if the volume is protected by copyright or in the public domain (indicated within the ‘accessRights’ node, where the value ‘ic’ here with the value "ic", which stands for “in copyright”). 

...

Features section

Minimize the ‘metadata’ "metadata" node and expand the ‘features’ "features" node. You’ll see something like this:

...

Once again, you will see some introductory terms like “id”, “type”, and “dateCreated”. You will also see ‘pageCount’ "pageCount" which indicates how long the volume is the number of pages in the volume (here 298 pages). Note the ‘pages’ "pages" node and expand it. 




This is the section of the EF file that indicates the volume’s page-level extracted features metadata. Each page is represented as an individual node object in sequential order (i.e., in this case from page ‘0’ "0" through page ‘298’"298"). Expand one of the page nodes.

...

This header contains 2 tokens (“tokenCount’) in 1 line ("lineCount"). There are even more nodes to expand here, but we will move on to ‘body’"body"

Once the body node is expanded, note that similar information as that listed in “header” is also displayed here, such as “tokenCount”, “lineCount”, “beginCharCount”, “endCharCount”, etc. 

...

If you were to move through each page of the document, you would find the tokens that appeared on that page, the part-of-speech if it it could be determined, and the number of times it appeared on the page.

...