Child pages
  • Fetching Volume OCR Content in HTRC Data Capsule (Secure Mode)
Skip to end of metadata
Go to start of metadata

In the virtual machine you created in HTRC Data Capsule, in maintenance mode, download this file DataAPI_SampleCode.zip, and place it somewhere in the capsule's file system.

Then, switch to secure mode, to fetch content by using Data API in Data Capsule (it won't work in maintenance mode to prevent data leak)

Unzip the folder, run this command to fetch some books

python DownloadVolumes.py htrc-id output.zip
  • the htrc-id is the volume id lists that you're interest in.
  • output.zip is the .zipped folder for the fetched OCR content

Note: to customize your htrc-id list, you will need to search in the HTRC portal or by using Solr API to acquire the volume IDs of your interest.

Changing Parameters

In DownloadVolumes.py script, go to the "Data API volume request parameters" section, and change the parameters there. You can use 'concat':'true' if you want to concatenate pages of a volume, and use 'mets':'true' if you want to return METS file together with volume content. Below are several examples. 

e.g. to concatenate book pages for a book, uncomment this line

VOLUME_PARAMETERS = {'concat':'true'}

to get METS file, uncomment this line (note: an METS file is a metadata file that comes with each volume, and it records MARC and archive information of the book)

VOLUME_PARAMETERS = {'mets':'true'}

If you only need to have volume content returned, without concatenating the pages, then use this line

VOLUME_PARAMETERS = {}

If you want to return mets record along with the volume content and concatenate all the pages into one single text file per volume, then use this line

VOLUME_PARAMETERS = {'mets' : 'true', 'concat' : 'true'}

Attachment

DataAPI_SampleCode.zip

 

  • No labels