HTRC worksets are user-created collections of HathiTrust volumes to be treated as data and analyzed using HTRC tools and services. Worksets are curated by researchers, and they can be shared and cited to improve reproducibility.
One of the easiest way to generate a list of volume IDs for analysis is to first create a collection in the HathiTrust Digital Library. You can then bring your collection to HTRC as a workset using a one-click import
Here are the basic steps:
Note: HTRC Analytics may display the order of your volumes differently than they appeared on the input file.
Other options for creating a workset include:
The HTRC Workset Builder 2.0 Beta is a beta tool and interface built over the HTRC Extracted Features Dataset to enable both volume-level metadata search and volume- and page-level unigram (single word) text search in order to build worksets for use with HTRC tools and services. Read documentation for Workset Builder, or follow a tutorial for importing a workset from Workset Builder into HTRC Analytics.
HTRC Algorithms can analyze volumes in a workset so long as they have been synched with HTRC from HathiTrust. While syncing happens regularly, there may be occasional discrepancies.
Worksets can contain in-copyright ("limited view") as well as public domain ("full view") volumes from HathiTrust. HathiTrust data is not exposed or viewable within HTRC Algorithms or worksets. A researcher applies an algorithm to their workset (collection) and the data is called and crunched behind the scenes.
HTRC can help you create a list of volume IDs if the options described above do not meet your needs. Contact email@example.com for assistance.
Worksets start as lists of HathiTrust volume identification numbers (for example, hvd.hn5f64). If uploading a volume list file to create a workset, the file should be in CSV (comma-separated-value) or TXT format, and while it may contain other columns, it is only required to have your volume IDs in the first column. The file should contain a header row containing the text "volume" or "id".
HTRC Workset Toolkit is a command line interface for use in the HTRC Data Capsule environment. It streamlines access to the HTRC Data API and includes utilities to pull text data and volume metadata into a capsule. Additionally, it allows a researcher to point OCR text data to analysis tools that are also available in the capsule.
It can also be used to manage volume IDs from HathiTrust collections, HathiTrust bibliographic record numbers, and more.
Additional documentation is also available here: https://htrc.github.io/HTRC-WorksetToolkit/cli.html