HTRC Workset Toolkit is a command line interface for use in the HTRC Data Capsule environment. It streamlines access to the HTRC Data API and includes utilities to pull OCR text data and volume metadata into a capsule. Additionally, it allows a researcher to point OCR text data to multiple topic modeling software that are also available in the capsule.
Additional documentation is also available here: https://htrc.github.io/HTRC-WorksetToolkit/cli.html
Getting the HTRC Workset Toolkit
- Capsules created after March 18, 2018 contain the Toolkit by default. If you created your capsule after March 18, 2018, please skip to Usage below.
- Capsules created prior to March 18, 2018 did not contain the HTRC Workset Toolkit command line interface by default, and user-installed versions of the Toolkit may now be out of date and fail to run. To ensure you are running the most up-to-date version of the Toolkit, please follow these steps to uninstall the current version and re-install the latest version:
Check whether you have the correct Python version installed in your capsule by typing the command indicated below on the command line in your capsule. The HTRC Workset Toolkit requires the Anaconda Python distribution, which is likewise standard in all recently-created capsules. and while the Toolkit is compatible with both Python 2.7 and 3.6, we recommend using the 3.6 version for future compatibility.
dcuser@dc-vm:~$ python --version Python 3.6.0 :: Anaconda 4.3.1 (64-bit)
Some users may have self-installed the Toolkit in their capsules prior to March 18, 2018. If you have already installed the HTRC Workset Toolkit, uninstall it by using pip, as indicated below.
dcuser@dc-vm:~$ pip uninstall htrc
Install the latest version of the HTRC Workset Toolkit.
dcuser@dc-vm:~$ pip install htrc
Please note that updating may affect the versions of packages listed at HTRC-WorksetToolkit/setup.py that you are running in your capsule, and therefore should be done with care for your existing workflows. Contact email@example.com for assistance upgrading your version of the Toolkit.
The HTRC Workset Toolkit has four primary functions which allow users to download metadata and OCR data, run analysis tools, and export lists of volume IDs from the capsule.
- Volume Download
- Metadata Download
- Pre-built Analysis Workflows
- Export of volume lists
The commands also expect a so-called workset path, which is how you point to the volume(s) with which you would like to work. The Toolkit accepts several different forms of identifiers for the workset path, as described in the following table. You can choose to use whichever is the most conducive to your research workflow.
|HathiTrust Catalog ID||001423370|
|Handle.org Volume URL||https://hdl.handle.net/2027/mdp.39015078560078|
|HathiTrust Catalog URL||https://catalog.hathitrust.org/Record/001423370|
|HathiTrust Collection Builder URL (for public collections only)||https://babel.hathitrust.org/shcgi/mb?a=listis;c=696632727|
|Local volumes file||/home/dcuser/Downloads/collections.txt|
Importing data (Volume Download)
The basic form to import OCR data is to run the command htrc download followed by your choice of several arguments that impact how the data is transferred and the workset path. This command will only work when your capsule is in secure mode.
The format for the command looks like this:
htrc download [-h] [-f] [-o OUTPUT] [-c] [file]
The brackets indicate optional text and/or text that should be changed before you run the command. The named arguments, which are the "flagged" letters in the command, are described in the following table:
|Named argument||What it does||What happens if it's not included||Other info|
Remove folder if exists
Default if not included:
|Folders will not be removed|
Indicates that you will be choosing a directory location where the files should go
|Data will be sent to “/media/secure_volume/workset/”||Should be followed by the directory path where you would like the data to go|
Concatenate a volume’s pages in to a single file
|Page files will not be concatenated|
The command ends with the workset path (called file in format shown above). You choose the identifier for the workset path as decribed above, for example the HathiTrust ID, local file of volume IDs, or public HathiTrust collection URL.
Import the data for the volumes in a HathiTrust collection, using the URL you can find when viewing the collection, and not concatenating files or removing folders. The files will be directed to the standard location (/media/secure_volume/workset/).
htrc download “https://babel.hathitrust.org/cgi/mb?a=listis&c=1337751722”
Import the data for one volume, indicated by it's HathiTrust volume ID, to a specified directory location called my_workset in this example. The files will not be concatenated and the folders will not be removed.
htrc download -o /media/secure_volume/my-workset coo.31924089593846
Import the data for volumes IDs that have been saved to a text file in your capsule. The list should NOT include a header row, and should include the IDs only in a single column. You can call the file whatever you like; we have called it mylist.txt for the example. In this example, the folders will be removed, but the files will not be concatenated, and the files will be directed to the standard location (/media/secure_volume/workset/).
htrc download -f mylist.txt
Import data for the volumes that share a HathiTrust record ID, which generally indicates that they are either multiple items that were digitized representing the same work, or that they are serial publications for a periodical. The record ID can be found in the URL when viewing the catalog record for an item. In this example, the files will be concatenated, the folders retained, and the files directed to a specific directory, which we have called my-workset for illustrative purposes.
htrc download -c -o /media/secure_volume/my-workset 009132117