This tutorial will show you how to use HTRC Analytics. Click on each link to follow the step-by-step instructions.
To use HTRC Analytics, you will need to sign up for an account. From within HTRC Analytics, you can access web-based algorithms, the Data Capsule administrative dashboard, and links to shared datasets.
Go to HTRC Analytics: https://analytics.hathitrust.org/ and on the top right of the webpage, click on the "Sign Up" button.
On the Sign Up page, enter requested information together with username you intend to use and password.
The password must meet these requirements:
Password must be more than 15 characters long.
Password must contain characters from three of the following five categories:
Uppercase characters of European languages (A through Z, with diacritic marks, Greek and Cyrillic characters)
Lowercase characters of European languages (a through z, sharp-s, with diacritic marks, Greek and Cyrillic characters)
Base 10 digits (0 through 9)
Any Unicode character that is categorized as an alphabetic character but is not uppercase or lowercase. This includes Unicode characters from Asian languages.
Password must not contain any white spaces.
Password must not contain your user ID.
Trouble shooting: if you can't successfully sign up
You need to have an academic unit email address to sign up. We maintain a growing list of allowed email domains, e.g. emails ending with .edu or edu.ac or edu.tw, and so if you find you can't successfully sign up with your academic email then it's probably your email domain is not on the list. In this situation, please submit an account request by clicking on the "Request an account" button in the error message you see on the sign-up page.
Activate your account
You will receive an email in the email that you registered. Go to your email inbox, and follow the activation link in the email to activate your account.
Now you have created an account on HTRC Analytics, you can sign in. From the top right of the page, click "Sign in" and then enter your username and password on the log-in page.
Browse public worksets
Sign in to HTRC Analytics, and from the home page, click on "Worksets" from the top menu to browse the public worksets and your own worksets.
You will be taken to the page listing all the worksets that are public or created by you.
You can filter worksets by name, or you can narrow the display to your worksets only.
You can click on the hyperlinks to see the volumes in the workset and minimal metadata about each volume. You can follow the links in the “Title” field to see the volume in the HathiTrust Digital Library. You can also click on "Download" button to download the HathiTrust volume IDs in the workset .
Uploading a workset
You can create a worket to use with HTRC algorithms by uploading a list of HahtiTrust volume IDs. Your volume ID list must be in CSV or TXT format, and the only thing it must contain are the volume IDs in the left-most column. Additional fields will be disregarded.
There are several ways to go about getting a list of volume IDs, the easiest of which is upload the metadata for a collection you create in the HathiTrust Digital Library. Collections in the HathiTrust Digital Library are created by searching the digital library, selecting volumes of interest, and then saving them.
Here are the steps you will need to follow:
- Go to the HathiTrust Digital Library and begin your search. (See: Tips for searching HathiTrust.)
- On the results page, click the boxes next to the items you would like to add to your collection, and to your workset. Or you can choose to "Select all on page." From the drop down labeled "Select Collection," choose which collection to which you would like to add your selected volumes. You can also choose to create a new collection at this point! (See: How to create a collection.)
- Collections can be created temporarily as you browse, or you can log-in to save your collection. Note that the credentials for the HathiTrust Digital Library are different from your HTRC Analytics account and are availble only to users are HathiTrust partner institutions, although guest accounts can be created. (See: How to create a guest account.)
- Once you have a collection and you are ready to transform it into a workset, you will download the metadata for the collection by clicking a button on the left side of the page for your collection. (See: How to download collection information.)
- Return to HTRC Analytics and sign in.
- Click "Worksets" from the top menu on the home page.
- Click the orange "Create A Workset" button toward the top right of the Worksets page.
- Give your workset a name, and decide whether you would like it to be public (visible to every user who signs in to HTRC Analytics) or private (visible only to you). The default is public.
- Upload the metadata file you downloaded for your collection.
- Click "Create Workset."
Validate a workset
HathiTrust is a dynamic repository: It continues to grow, and, with less frequency, items are removed or their access profile changes. In order to check if the volumes in your workset are available for analysis using HTRC algorithms or the HTRC Data Capsule environment, you can validate a workset.
Note: HTRC Algorithms and HTRC Data Capsules can currently access a snapshot of public domain volumes from the HathiTrust Digital Library. The HTRC is making improvements to increase the frequency with with data is synced from HathiTrust. The most recent HTRC Extracted Features release represents a snapshot of 13.7 million volumes from HathiTrust, and HathTrust+Bookworm likewise can visualize 13.7 volumes from the Digital Library.
To validate a workset, start by clicking "Worksets" in the top menu.
Then, click the button toward the top right that says, "Validate Workset."
You will be able to choose the workset you would like to validate, either one of your own or a public workset.
You can also validate a list of HathiTrust volume IDs before you create a workset from them. As when you upload a file to create a workset, you can upload a CSV, TSV, or TXT file where the only required field is the list of volume IDs in the first column.
Validating a workset will show you how many of the volumes in your workset are currently accessible via HTRC algorithms or the HTRC Data Capsule environment. You can download either the volume IDs that are valid or those that are not. You could then upload the valid IDs as a new workset, if you wanted.
Run a text analysis algorithm
HTRC Analytics provides a number of text analysis algorithms for users to analyze their worksets, ranging from simple word count to sophisticated ones such as topic modeling. Typically, you need to choose an algorithm of your interest and then select the workset you want to run analysis on with proper parameter settings.
Make sure you are logged in to HTRC Analytics. Navigate to the algorithms by clicking "Algorithms" on the menu bar on the top part of page.
On the Algorithms page, select an algorithm from the list. You can read the description to learn about what the algorithms can do on your workset. For this example we chose the "Meandre_Topic_Modeling" algorithm. Click “Execute” to use this algorithm.
Fill in the required parameters. Some parameters have default values while for some others you will need to fill them in. You will also need to select a workset (i.e. a collection) that you want to work with. Below shows the parameters entered for this demo. Click on the "Submit" button to submit the job.
After submitting, you will be taken to the Jobs page for viewing job status. You can stay on the Jobs page and refresh the page to see the most up-to-date status of the job. Depending on how fast the job is computed, you will probably see the job listed in the "Active Jobs" section. You can also see all the active jobs submitted by you.
Examine results of algorithm
During and after submitting a text analysis job, you can see what your work on the "Jobs" page.
After the job is finished, you can see the job is listed in the "Completed Jobs" section. Click on the job name to see its result.
This is also the page you will come to to see your results in the future. To get here again, click the Jobs button from the Algorithm page. (Note that you must be signed in to access these options.)
If you want to view past jobs, you can filter the results by name.
To view results, click on the job name.
On this job result page, I can use the tabs to see outputs of running the topic modeling algorithm. This algorithm displays the top words in my workset as a tag cloud (tagcloudcleantokencounts.html), in a list (tagcouldcleantokencounts.csv.txt), and logs (stdout.txt and stderror.txt).
Download a workset
After you have created a workset, you can download it as a list of volume identifiers in comma separated value (csv) format. Because each workset is functionally a list of pointers to content in the HathiTrust Digital Library, the full text of the volumes is not included in the download. If you are interested in receiving a dataset from the HathiTrust to do research on your own machine, please refer to the directions for requesting a custom dataset. The volume identifiers in a workset are consistent with the volume identifiers used elsewhere across the HathiTrust.
From the homepage of HTRC Analytics sign in and then navigate to Worksets.
Click on the name of the workset you would like to download and click the "Download" button.