You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 13 Next »

HTRC Workset Toolkit is a command line interface for use in the HTRC Data Capsule environment. It streamlines access to the HTRC Data API and includes utilities to pull OCR text data and volume metadata into a capsule. Additionally, it allows a researcher to point OCR text data to analysis tools that are also available in the capsule. 

Additional documentation is also available here:

Getting the HTRC Workset Toolkit

  • Capsules created after March 18, 2018 contain the Toolkit by default. If you created your capsule after March 18, 2018, please skip to Usage below. 

  • Capsules created prior to March 18, 2018 did not contain the HTRC Workset Toolkit command line interface by default, and user-installed versions of the Toolkit may now be out of date and fail to run. To ensure you are running the most up-to-date version of the Toolkit, please follow these steps to uninstall the current version and re-install the latest version:
    • Check whether you have the correct Python version installed in your capsule by typing the command indicated below on the command line in your capsule. The HTRC Workset Toolkit requires the Anaconda Python distribution, which is likewise standard in all recently-created capsules. And while the Toolkit is compatible with both Python 2.7 and 3.6, we recommend using the 3.6 version for future compatibility. 

      Check python version
      dcuser@dc-vm:~$ python --version
      You should see this
      Python 3.6.0 :: Anaconda 4.3.1 (64-bit)
    • Some users may have self-installed the Toolkit in their capsules prior to March 18, 2018. If you have already installed the HTRC Workset Toolkit, uninstall it by using pip, as indicated below. 

      Uninstall the toolkit
      dcuser@dc-vm:~$ pip uninstall htrc
    • Install the latest version of the HTRC Workset Toolkit.

      Install the toolkit
      dcuser@dc-vm:~$ pip install htrc

Please note that updating may affect the versions of packages listed at HTRC-WorksetToolkit/ that you are running in your capsule, and therefore should be done with care for your existing workflows. Contact for assistance upgrading your version of the Toolkit.


The HTRC Workset Toolkit has four primary functions which allow users to download metadata and OCR data, run analysis tools, and export lists of volume IDs from the capsule. 

  • Volume Download
    • htrc download
  • Metadata Download
    • htrc metadata
  • Pre-built Analysis Workflows
    • htrc run
  • Export of volume lists
    • htrc export

The commands also expect a so-called workset path, which is how you point to the volume(s) you would like to analyze. The Toolkit accepts several different forms of identifiers for the workset path, as described in the following table. You can choose to use whichever is the most conducive to your research workflow. You don't need to specific which kind of workset path you will be using, you can simply include the identifier (e.g. the HahtiTrust ID or the HathiTrust Catalog URL) in your command. 

Identifier TypeExampleNote
Local volumes file/home/dcuser/Downloads/collections.txt
HathiTrust IDmdp.39015078560078
HathiTrust Catalog ID001423370
HathiTrust URL;view=1up;seq=13must be in quotes in command Volume URL be in quotes in command
HathiTrust Catalog URL be in quotes in command
HathiTrust Collection Builder URL (for public collections only);c=696632727must be in quotes in command

Importing data (Volume Download

The basic form to import OCR data is to run the command below, which includes your choice of several arguments that impact how the data is transferred and the workset path. This command will only work when your capsule is in secure mode. 

The format for the command looks like this:

htrc download [-h] [-f] [-o OUTPUT] [-c] [FILE]

The brackets indicate optional text and/or text that should be changed before you run the command. The named arguments, which are the "flagged" letters in the command, are described in the following table: 

Named argumentWhat it doesWhat happens if it's not included
-f, --force

Remove folders if they exist

Folders will not be removed
-o, --output

Indicates that you will be choosing a directory location where the files should go. Should be followed by the in-capsule directory path of your choice.

Data will be sent to “/media/secure_volume/workset/”
-c, --concat

Concatenate a volume’s pages in to a single file

Page files will not be concatenated
h, --helpDisplays the help manual for the toolkitHelp manual is not displayed

The command ends with the workset path (called file in format shown above). You choose the identifier for the workset path as decribed above, for example the HathiTrust ID, local file of volume IDs, or public HathiTrust collection URL. 


The following command will import the data for the volumes in a HathiTrust collection using the URL you can find when viewing the collection. It will not concatenate files or remove folders. The files will be directed to the standard location (/media/secure_volume/workset/).

htrc download “”

The following command will import the data for one volume, indicated by its HathiTrust volume ID, to a specified directory location called my_workset in this example. The files will not be concatenated and the folders will not be removed. 

htrc download -o /media/secure_volume/my-workset coo.31924089593846

The following command will import the data for volumes IDs that have been saved to a text file in your capsule. The list should NOT include a header row, and should include the IDs only in a single column. You can call the file whatever you like; we have called it mylist.txt for the example. In this example, the folders will be removed, but the files will not be concatenated, and the files will be directed to the standard location (/media/secure_volume/workset/).

htrc download -f mylist.txt

The following command will import data for the volumes that share a HathiTrust record ID, which generally indicates that they are either multiple items that were digitized representing the same work, or that they are serial publications for a periodical. The record ID can be found in the URL when viewing the catalog record for an item. In this example, the files will be concatenated, the folders retained, and the files directed to a specific directory, which we have called my-workset for illustrative purposes. 

htrc download -c -o /media/secure_volume/my-workset 009132117

  • No labels