Learn the three different ways you can create worksets in HTRC Analytics, as well as how to validate and download a workset to your personal machine.
Which Method Is Right for Your Use Case?
|Workset Creation Method||Best Used For...|
|Import a collection from HathiTrust||Building a small workset containing specific previously-known volumes|
|Build a workset with Workset Builder 2.0|
Building a large workset containing previously-unknown volumes that match specific text or metadata criteria
|Upload a list of volume IDs||Combining multiple worksets or HDL collections, or making small additions or deletions to existing worksets or collections|
Import a Collection from HathiTrust
This is an ideal method for building a relatively small workset of specific volumes.
Use this tutorial to import an existing collection from the HathiTrust Digital Library or build and import your own.
- Log into hathitrust.org using your institutional credentials.
- Enter your search terms. You will likely want to uncheck the box marked “Full View Only.” While it is checked, your search will only return public domain works that can be viewed in full. By unchecking the box, your search will include in-copyright material that cannot be read directly but can be analyzed via HTRC tools. (See HDL’s Search Tips for help with searching.)
- On the search results page, check the box next to the volume name, then click the Add button next to the drop-down menu that reads “”New collection…”
- When adding volumes to a new collection, you will be prompted to name the collection and give it a description. Note: In order to import your collection into HTRC Analytics, you must make your collection public.
- Add additional volumes to your collection. You can add multiple volumes at the same time by checking the boxes next to multiple books. Make sure you have the right collection selected before clicking Add.
- Once you have finished adding volumes, select Collections from the menu at the top of your screen to see a list of all your collections. Click on the one you are currently working with. On the collection’s page, copy the URL in the box underneath the heading “Link to this collection.”
- In a separate tab, go to https://analytics.hathitrust.org/ and sign in using your institutional credentials, then select Worksets from the menu at the top of the page.
- Click Create a Workset, then Import from HathiTrust.
- Paste the collection URL that you copied from HDL, then click Fetch Collection. The page will automatically assign your new workset the name and description for your HDL collection, but you can change them if you would like. Select if you would like your workset to be public or private, then click Create Workset.
You will return to a list of all of your worksets. From there, you can select your workset to view information about the volumes, change its public/private status, and run HTRC algorithms.
Build a Workset Using Workset Builder 2.0
This is an ideal method for building a relatively large workset containing volumes that match specific criteria.
This tutorial covers a fairly simple use case using metadata searching. For a more detailed explanation, see the HTRC Workset Builder 2.0 (Beta) for Extracted Features 2.0 page.
- First, navigate to the Workset Builder, available at https://solr2.htrc.illinois.edu/.
- Perform a unigram (single-term) text, metadata or combined text and metadata search in the Workset Builder to generate a list of results. Note that as the Workset Builder is searching millions of volumes and trillions of individual pages, searches sometimes take a moment to load.
When performing a metadata search, note that certain fields require using a standardized values, typically MARC formats. (For a more detailed guide to metadata fields and how to search them, please see the dedicated HTRC Workset Builder 2.0 page.) For example, if performing a search for books in Spanish, select Language from the drop-down menu next to the search box, then enter "spa" (the MARC code for Spanish). Searching for "Spanish" would yield no results.
You can also search multiple metadata fields at once by chaining together multiple searchers with the AND operator. With All Fields selected from the drop down menu, try searching "language_t:spa AND pubDate_i:1950" for a list of all volumes in Spanish published in 1950:
- From the page of results, choose volumes to keep in your shopping cart by checking boxes next to each volume and pressing the yellow "Add" button or by selecting volumes via check box and dragging and dropping them into the shopping cart icon at the top right on the result page.
- Once you have selected the desired volumes your workset, click on the shopping cart icon on the results page to view your selection. This page will show you what is in your current selection, as well as present new options for interacting, saving, and exporting your workset.
- From the shopping cart page, you can export your workset as a list of volume IDs, a federated metadata file in JSON, TSV or CSV format, or download the JSON Extracted Features data for the volumes in your selection. You can also choose the "Export as Workset" button to directly export your shopping cart to HTRC Analytics as a workset.
- Once you click to export the workset, you'll be directed to HTRC Analytics, and prompted to login, if you are not currently. Once logged in, you'll be taken directly to the import page, where you're asked to add a name and description for your workset, and decide if you'd like to make it a public (shareable via URL and usable by others) or private (viewable only to your user account) workset.
The current version of the Workset Builder can be accessed at https://solr2.htrc.illinois.edu/.
Build a Workset from a List of Volume IDs
You can create a workset to use with HTRC algorithms by uploading a list of HahtiTrust volume IDs.
There are several ways to go about getting a list of volume IDs, including by downloading the metadata for HathiTrust collection(s) and curating a list locally, using the HathiTrust Bibliographic API, or the HathiFiles.
This is an ideal method for constructing a workset partially based on another researcher's existing workset. Currently, it is not possible to add to or remove volumes from worksets once they have been created on HTRC Analytics, but it is fairly simple to download those volumes' IDs and use them to construct a new workset.
Once you have a list of volume IDs, make sure it conforms to the file requirements. Your volume ID list must be in CSV, TSV, or TXT format, and the only thing it must contain are the volume IDs in the left-most column. Additional fields will be ignored, so while they can be present, they won't affect the upload or the metadata for your workset. The file should contain a header row containing the text "volume" or "id".
This tutorial will cover the specific circumstance of supplementing another researcher's list of volumes with volumes of your own.
- Pick a collection that interests you from the HathiTrust Digital Library Featured Collections. When viewing the collection page, click Download Metadata, making sure that Tab-Delimited Text is the download format.
- Create your own small collection of volumes and download its metadata as a TSV as well.
- Open both TSVs using Excel or a similar application. Your files will look like this:
- Copy all rows except the header from your smaller collection and paste it into the TSV for the larger collection, then save the combined list. (Your computer may default to saving the file as a TXT file. This will not be an issue for uploading the list.)
- Log into HTRC Analytics and go to the Worksets page. From there, select Create a Workset, then choose Upload File.
- On the workset creation page, give your workset a title and description, then upload your combined file. Select if you would like your workset to be public or private, then click Create Workset.
In addition to working with collections from the HathiTrust Digital Library, you can use these same instructions for making changes to worksets on HTRC Analytics. Go to the page for the workset you would like to start with and click the Download button to download a csv containing the workset's volume IDs and other metadata.
Validate a Workset or List of Volume IDs
HathiTrust is a dynamic repository: It continues to grow, and, with less frequency, items are removed or their access profile changes. In order to check if the volumes in your workset are available for analysis using HTRC algorithms or the HTRC Data Capsule environment, you can validate a workset.
Note: HTRC Algorithms and HTRC Data Capsules can currently access a snapshot of public domain volumes from the HathiTrust Digital Library. The HTRC is making improvements to increase the frequency with which data is synced from HathiTrust. The most recent HTRC Extracted Features release represents a snapshot of 13.7 million volumes from HathiTrust, and HathTrust+Bookworm likewise can visualize 13.7 volumes from the Digital Library.
- Log into HTRC Analytics and click Worksets in the top menu.
- Click the Validate Workset button toward the top right of the screen. Select a public or private workset you would like to validate from the drop-down menu. Alternatively, if you would like to validate a list of volume IDs before creating a workset, upload a file to validate. As when you upload a file to create a workset, you can upload a CSV, TSV, or TXT file where the only required field is the list of volume IDs in the first column.
- Validating a workset will show you how many of the volumes in your workset are currently accessible via HTRC Algorithms or the HTRC Data Capsule environment. You can download either the volume IDs that are valid or those that are not. You can then upload the valid IDs as a new workset.
Need more help?
If you are experiencing issues with worksets or any other aspect of HTRC Analytics or have other questions, please contact firstname.lastname@example.org for assistance.