Child pages
  • Fetching Volume OCR Content in HTRC Data Capsule (Secure Mode)

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Optional method

Researchers can use the HTRC Data API to bring text data into their capsule, and can refer to the HTRC Data API guide for more details.

Preferred method

HTRC has also developed a Python library for loading volumes into the Data Capsule environment that may be of use:  the HTRC Workset Toolkit. The Toolkit is standard in all capsules created after March 18, 2018. If you have an earlier-created capsule then you will need to install or update the Toolkit

Make sure you are in secure modein order to prepare to fetch content into your Data Capsule; it won't work in maintenance mode for security reasons.

You can use the Workset Toolkit's "htrc download" command to transfer the volumes of interest. you would like to include in your dataset.

For example, running the following command below will transfer the OCR text data for will import the volumes in the generic htrc-id list that comes with the Workset Toolkit to a directory called "output." HathiTrust collection 'Adventure Novels: G.A. Henty'.

Code Block
languagepy
htrc download htrc-id

...

'https://babel.hathitrust.org/cgi/mb?a=listis;c=464226859'


You can also curate a list of volumes whose data you would like to import by creating a file containing a HathiTrust volume id ID list that you're interested in, with one ID per line. Run the above command replacing htrc-id the collection URL with your file name.

For example, if you had a file called myvolumes.txt, you would run the following command.

...

.

Code Block
htrc download myvolumes.txt


In the above examples, output is the destination folder for the fetched OCR content. If you do not provide an output - by omitting both the -o and directory name - then the files will go to the default directory (the data will be transferred to “/media/secure_volume/workset). You can call the destination folder anything you like by replacing "output" with the name of your choice. 

...

/”. If you want to specify an alternative location, provide an output by including -o and the file path in your command.

Other functions of the Workset Toolkit

You can also use a volume ID, collection URL, or catalog record ID to import volumes. Additionally, you have the option to concatenate files and to , remove folders, and retrieve metadata using the functions of the Workset Toolkit.

For more examples, see the detailed guide.

For the technical documentation, see: https://htrc.github.io/HTRC-WorksetToolkit/cli.html

Optional method

Researchers can use the HTRC Data API to bring text data into their capsule, and can refer to the HTRC Data API guide for more details.