This tutorial will show you how to use the HTRC Portal and Workset Builder. Click on each link to follow the step-by-step instructions.
To use the Portal and Workset Builder, you will need to sign up an account on the Portal. From within the Portal, you can also access the Data Capsule.
- Password must be more than 15 characters long.
- Password must contain characters from three of the following five categories:
- Uppercase characters of European languages (A through Z, with diacritic marks, Greek and Cyrillic characters)
- Lowercase characters of European languages (a through z, sharp-s, with diacritic marks, Greek and Cyrillic characters)
- Base 10 digits (0 through 9)
- Nonalphanumeric characters
- Any Unicode character that is categorized as an alphabetic character but is not uppercase or lowercase. This includes Unicode characters from Asian languages.
- Password must not contain any white spaces.
- Password must not contain your user ID.
Trouble shooting: if you can't successfully sign up
You need to have an academic unit email to sign up. We maintain a growing list of allowed email domains, e.g. emails ending with .edu or edu.ac or edu.tw, and so if you find you can't successfully sign up with your academic email then it's probably your email domain is not on the list. In this situation, please submit an account request by clicking on the "Request Account" button on the bottom of the Sign Up page.
Browse public worksets from other users and your previously created worksets.
You will be navigated to the page listing all the worksets that are public or created by you.
You can click on the hyperlinks to see volumes information of each workset, and can also click on the "Edit" and "Download" button to edit volume membership of a workset and download volume IDs respectively.
Create a workset of interest to you by searching and selecting volumes in the Workset Builder.
Note: You will be led to the HTRC Workset Builder system, a system accessible through the portal, for building your workset. That also means you need to sign in again once arriving at the Workset Builder.
You will then see a pop-up window, click on the "Go" button. This will take you to the page for signing in to the Workset builder. You will need to sign in again at the Workset Builder system using the same username and password as in the Portal.
You will be asked to enter your username and password on the next page. Input the same user name and password as the one in Portal.
Then you will be navigated to the approval page. Click on the "Approve" button. Alternatively you can also click on the "Approve Always" button.
For the entered search terms, you can select which field to query them in. The available fields are Full Text, Title, Author, Subject. Then click on the "Search" button to obtain search results. For example, below is a query of the search term"Shakespeare" in the Author field.
You will see all the volumes with Shakespeare as author shown up on the result page.
The "Selected Items" on the top right of the page reflects the number of volumes you have selected. In this example, we select the first two volumes in the search result to put in a workset, and you can see that the text says "Selected Items (2)", indicating that two volumes have been selected.
That takes you too the Selected Items page where your selected volumes are listed. You can add these volumes to a workset.
You have two options with regard to where to add the selected volumes: create a new workset for them, or put them to a pre-existing workset. For either option, you need to click on on the "Create/Update Workset" button right above the volume showing section. This takes you to the page for creating a new workset or adding to a pre-existing one.
Please see below for operations for the 2 options.
Option 1: Create a new workset
After clicking the "Create/Update Workset" button on the Selected Items page, you are led to this page. Fill in the fields on the page for the workset creation.
For the "Availability" field, you can choose Public if you want to make the workset publicly accessibly, or set it to Private if you only want yourself to see it.
Option 2: Replace a pre-existing workset
Alternatively, you don't need to create a new workset, but replace volumes of a pre-existing workset using with the selected items.
After clicking the "Create/Update Workset" button on the Selected Items page, on the page you are led to, you can choose an existing workset listed under the "Update an existing workset" section, and then click the "Update" button.
Warning: this action will erase all the volumes in the existing workset and replace them with the newly selected items.
You will be led to the Manage Workset page. On the page, in the drop down list, select the workset you want to manage. After selecting one, click on the "Open" button to open the Workset. Here we are using the "ontology_workset", a public workset.
The Workset page (i.e. the Selected Items page for this workset) lists the volumes of this workset. You can remove an unwanted volume by deselecting the checkbox next to a volume. Below shows deselecting the first volume in the workset. Here the first two items are deselected.
After deselecting the unwanted volumes, click on the "Create/Update Workset" button to save the change.
On the next page, choose the workset you want to update. Since we are working with the "ontology_workset" workset here, we select the "ontology_workset" to update.
It is done by select the "ontology_workset" in the drop down list under the "Update an existing workset" section, and then clicking the "Update" button.
Workset Builder -> Portal
In Workset Builder, click on the "Portal" button on the top right of the page to go to Portal.
Portal -> Workset Builder
In Portal, you can go to the Workset Builder by either clicking on "Worksets -> Create Workset" on the top menu bar, or clicking on "Create Workset" on the bottom left of the Portal homepage.
Choose text analysis algorithms and obtain results in the portal
In the Portal, navigate to the page for text analysis by clicking on "Algorithms" on the menu bar on the top part of page.
On the Algorithms page, select an algorithm from the list. You can read the description to learn about what the algorithms can do on your workset. In this example we choose the "Meandre_Topic_Modeling" algorithm.
Click on the "Meandre_Topic_Modeling" algorithm link, and on the next page fill in the needed parameters. Some parameters have default values while for some others you will need to fill them in. You will also need to select a workset (i.e. a collection) that you want to work with. Below shows the parameters entered for this demo. Click on the "Submit" button to submit the job.
After submitting, you will be led to the Results page for viewing job status.
You can stay on the Results page and refresh the page to see the most up-to-date status of the job. See the Examine Results page for more details of the Results page.
Descriptions of the text analysis algorithms available in the HTRC Portal are listed below. They are also available on the Algorithms page in the Portal.
|#||Name||Simple description||Technical description||Author||Version|
EF Rsync Script Generator
Generate a script that allows you to download extracted features data for your workset of choice. The script can be run locally, listing the Rsync commands to access the volumes of the workset.
Result of job: script to download extracted features data files
Generates a script to download the extracted features (EF) data for the specified workset using rsync. For more information on the extracted features data see https://analytics.hathitrust.org/features.
Note: Extracted features data was not created for a small number of volumes, so it is possible that not all of your workset volumes will be processed.
Download the bibliographic information for each volume in your workset of choice.
Result of job: zip (compressed) file that, when downloaded and expanded, contains the bibliographic metadata for each volume
|Takes a workset as input and outputs the MARC (Machine Readable Cataloging) record for each volume in the workset in MarcXML format.||Zong Peng||1.7|
|3||Meandre Dunning Log-likelihood to Tagcloud|
Compare and contrast two worksets by identifying the words that are more and less common in one workset, called the analysis workset, than in another workset, called the reference workset.
Result of job: tag cloud visualizations and lists of most and least commonly shared words in csv format
Note: The upper limit on the number of volumes is 1000.
|4||Meandre OpenNLP Date Entities To Simile|
Visualize the dates in a workset on a timeline. Each date (ex. May 4, 1803) is displayed with its unique HathiTrust Digital Library volume identifier, the page on which it occurred, and a snippet of the sentence in which it occurred.
Result of job: timeline visualization
Information extraction is used to extract date entities that can be displayed on a timeline. This allows a researcher to review sentences that include dates via the timeline. We are using the OpenNLP system to automatically extract the entities from the text. The date entities, and the sentences in which they exist, are then displayed in Simile Timeline.
Note: The upper limit on the number of volumes is 100.
|5||Meandre OpenNLP Entities List|
Generate a list of all of the names of people and places, as well as dates, times, percentages, and monetary terms, found in a workset. You can choose which entities you would like to extract.
Result of job: table of the named entities found in a workset
Note: The volume limit is 100.
|6||Meandre Spellcheck Report Per Volume|
Find misspelled words that are the result of OCR errors in the text of a workset's volumes, with suggested replacements. Currently the replacements cannot be made within the Portal.
Result of job: lists of the misspellings in a workset, the number of times they occur, and suggested corrected spellings
The token counts data is used to determine if a suggested dictionary word occurs in the token counts data, and whether it should be used. There are options for customizing the transformation rules which indicate the types of OCR errors that should be corrected. For instance, a known problem is the transformation of an "li" to an "h" and vice versa, and this is expressed with the transformation rule "li=h", which says that, for all misspelled words with an "h", a check will be done to see if a conversion to "li" forms a correctly spelled word.
Ceate a tag cloud visualization of the most frequently occurring words in a workset, as well as a list of the most frequent words. In a tag cloud, the size of the word is displayed in proportion to the number of times it occurred.
Result of job: list of most frequent words and a tag cloud visualization of them
|8||Meandre Tagcloud with Cleaning|
Performs cleaning of the text before it allows you to create a tag cloud visualization of the most frequently occurring words in a workset. In a tag cloud, the size of the word is displayed in proportion to the number of times it occurred.
Result of job: list of most frequent words in a workset's cleaned text and a tag cloud visualization of them
Performs token counts with some additional text cleaning and displays the most frequent tokens in a tag cloud.
Note: The upper limit on the number of volumes is 1000.
|9||Meandre Topic Modeling|
Identify "topics" in a workset based on words that have a high probability of occurring close together in the text. Topics are models trained on co-occurring text using Latent Dirichlet Allocation (LDA), where each topic is treated as a generative model and volumes are assigned a probability of how likely each topic is to have generated that text. The most likely words for a topic are displayed as a word cloud.
Result of job: xml file with topics, and visualizations of them in the form of tag clouds.
Performs topic modeling analysis in the style of Latent Dirichlet allocation (LDA) and its variants, notably the form used in Mallet.
|10||Simple Deployable Word Count|
Identify the words that occur most often in a workset and the number of times they occur.
Result of job: list of the most frequent words in a workset and the number of times they occur
|A simple word count Java client that uses the HTRC Data API to access the volumes in the specified workset, and displays the top N most frequently occurring words within that workset.||Yiming Sun||1.4|
- Naive-Bayes classification
Examine text analysis job status and results
Right after submitting a job, depending on how fast the job is computed, you can probably see the job listed in the "Active Job" section. You can also see all the active jobs submitted by you.
After the job is finished, you can see the job is listed in the "Completed Jobs" section. Click on the job name to see its result.
On this demo job result page, I clicked on the "topic_tagcloud.html" link to see the tag cloud of the topics.
Create a workset by supplying a list of volume IDs
Downloading a workset
After you have created a workset using the Workset Builder, you can download it as a list of volume identifiers in comma separated value (csv) format. Because each workset is functionally a list of pointers to content in the HathiTrust Digital Library, the full text of the volumes is not included in the download. If you are interested in receiving a dataset from the HathiTrust to do research on your own machine, please refer to the directions for requesting a custom dataset. The volume identifiers in a workset are consistent with the volume identifiers used elsewhere across the HathiTrust.
From the homepage of the Portal, sign in and then navigate to either "My Worksets" or "View All Worksets"
Click "Download" next to the workset you would like to download to your computer.
Uploading a workset
If you already have a list of identifiers for volumes you would like to analyze, including if you have downloaded the metadata for a collection in the HathiTrust Digital Library, you can create a workset based on that list by uploading the volume identifiers in a comma separated value (csv) file.
If you are trying to run the Naïve Bayes classification algorithm in the Portal, you will need to upload a workset with the volume identifiers in one column ("volume_id"), and the labels you would like to apply to each volume in the workset in another ("class"). Otherwise, if you aren't planning to use the classification algorithm, only one column is needed ("volume_id").
To upload, click on "Worksets" then "Upload Workset" on the top menu bar, or click on "Upload Workset" at the bottom of the homepage of the Portal.
A pop-up menu will open. Fill in the fields with the proper values, then click "Submit." If you click the box to make the workset private, it will only be available to you. Otherwise, the workset will be public and other users will be able to access it.
Use the HTRC Data Capsule for non-consumptive research on HT corpus