HTRC Workset Toolkit is a command line interface for use in the HTRC Data Capsule environment. It streamlines access to the HTRC Data API and includes utilities to pull OCR text data and volume metadata into a capsule. Additionally, it allows a researcher to point OCR text data to analysis tools that are also available in the capsule.
Additional documentation is also available here: https://htrc.github.io/HTRC-WorksetToolkit/cli.html
Getting the HTRC Workset Toolkit
- Capsules created after March 18, 2018 contain the Toolkit by default. If you created your capsule after March 18, 2018, please skip to Usage below.
- Capsules created prior to March 18, 2018 did not contain the HTRC Workset Toolkit command line interface by default, and user-installed versions of the Toolkit may now be out of date and fail to run. To ensure you are running the most up-to-date version of the Toolkit, please follow these steps to uninstall the current version and re-install the latest version:
Check whether you have the correct Python version installed in your capsule by typing the command indicated below on the command line in your capsule. The HTRC Workset Toolkit requires the Anaconda Python distribution, which is likewise standard in all recently-created capsules. And while the Toolkit is compatible with both Python 2.7 and 3.6, we recommend using the 3.6 version for future compatibility.
Check python versiondcuser@dc-vm:~$ python --version
You should see thisPython 3.6.0 :: Anaconda 4.3.1 (64-bit)
Some users may have self-installed the Toolkit in their capsules prior to March 18, 2018. If you have already installed the HTRC Workset Toolkit, uninstall it by using pip, as indicated below.
Uninstall the toolkitdcuser@dc-vm:~$ pip uninstall htrc
Install the latest version of the HTRC Workset Toolkit.
Install the toolkitdcuser@dc-vm:~$ pip install htrc
Please note that updating may affect the versions of packages listed at HTRC-WorksetToolkit/setup.py that you are running in your capsule, and therefore should be done with care for your existing workflows. Contact htrc-help@hathitrust.org for assistance upgrading your version of the Toolkit.
Usage
The HTRC Workset Toolkit has four primary functions which allow users to download metadata and OCR data, run analysis tools, and export lists of volume IDs from the capsule.
- Volume Download
htrc download
- Metadata Download
htrc metadata
- Pre-built Analysis Workflows
htrc run
- Export of volume lists
htrc export
The commands also expect a so-called workset path, which is how you point to the volume(s) you would like to analyze. The Toolkit accepts several different forms of identifiers for the workset path, as described in the following table. You can choose to use whichever is the most conducive to your research workflow. You don't need to specific which kind of workset path you will be using, you can simply include the identifier (e.g. the HahtiTrust ID or the HathiTrust Catalog URL) in your command.
Identifier Type | Example | Note |
---|---|---|
Local volumes file | /home/dcuser/Downloads/collections.txt | |
HathiTrust ID | mdp.39015078560078 | |
HathiTrust Catalog ID | 001423370 | |
HathiTrust URL | https://babel.hathitrust.org/cgi/pt?id=mdp.39015078560078;view=1up;seq=13 | must be in quotes in command |
Handle.org Volume URL | https://hdl.handle.net/2027/mdp.39015078560078 | must be in quotes in command |
HathiTrust Catalog URL | https://catalog.hathitrust.org/Record/001423370 | must be in quotes in command |
HathiTrust Collection Builder URL (for public collections only) | https://babel.hathitrust.org/shcgi/mb?a=listis;c=696632727 | must be in quotes in command |
Importing data (Volume Download)
The basic form to import OCR data is to run the command below, which includes your choice of several arguments that impact how the data is transferred and the workset path. This command will only work when your capsule is in secure mode.
The format for the command looks like this:
htrc download [-h] [-f] [-o OUTPUT] [-c] [FILE]
The brackets indicate optional text and/or text that should be changed before you run the command. The named arguments, which are the "flagged" letters in the command, are described in the following table:
Named argument | What it does | What happens if it's not included |
-f, --force | Remove folders if they exist | Folders will not be removed |
-o, --output | Indicates that you will be choosing a directory location where the files should go. Should be followed by the in-capsule directory path of your choice. | Data will be sent to “/media/secure_volume/workset/” |
-c, --concat | Concatenate a volume’s pages in to a single file | Page files will not be concatenated |
h, --help | Displays the help manual for the toolkit | Help manual is not displayed |
The command ends with the workset path (called file in format shown above). You choose the identifier for the workset path as decribed above, for example the HathiTrust ID, local file of volume IDs, or public HathiTrust collection URL.