Researchers can use the HTRC Data API to bring text data into their capsule, and can refer to the HTRC Data API guide for more details.
HTRC has also developed a Python library for loading volumes into the Data Capsule environment that may be of use: HTRC Workset Toolkit. The Toolkit is standard in all capsules created after March 18, 2018. If you have an earlier-created capsule then you will need to install or update the Toolkit.
Make sure you are in secure mode in order to prepare to fetch content into your Data Capsule; it won't work in maintenance mode for security reasons.
You can use the Workset Toolkit's "htrc download" command to transfer the volumes of interest. For example, running the command below will transfer the OCR text data for the volumes in the generic htrc-id list that comes with the Workset Toolkit to a directory called "output."
To customize the volumes you transfer to your capsule, create a file containing a volume id list that you're interested in, with one ID per line. Run the above command replacing htrc-id with your file. For example, if you had a file called myvolumes.txt, you would run the following command.
- To customize your volume ID list, you will need to search in HathiTrust or using other metadata sources, including HathiFiles.
In the above examples, output is the destination folder for the fetched OCR content. If you do not provide an output - by omitting both the -o and directory name - then the files will go to the default directory (/media/secure_volume/workset). You can call the destination folder anything you like by replacing "output" with the name of your choice.
You can also use a volume ID, collection URL, or catalog record ID to import volumes. Additionally, you have the option to concatenate files and to remove folders.
For more examples, see the detailed guide.
For the technical documentation, see: https://htrc.github.io/HTRC-WorksetToolkit/cli.html