Skip to end of metadata
Go to start of metadata
This tutorial will show you how to use the HTRC Portal and Workset Builder. Click on each link to follow the step-by-step instructions.

 

To use the Portal and Workset Builder, you will need to sign up an account on the Portal. From within the Portal, you can also access the Data Capsule.

 Sign up for an account, and sign in



Sign Up

On the top right of the webpage, click on the "Sign Up" button. The Sign Up page will come up upon click of the button.

On the Sign Up page, enter requested information together with username you intend to use and password. The password must meet these requirements:                                    
  • Password must be more than 15 characters long.
  • Password must contain characters from three of the following five categories:
    • Uppercase characters of European languages (A through Z, with diacritic marks, Greek and Cyrillic characters)
    • Lowercase characters of European languages (a through z, sharp-s, with diacritic marks, Greek and Cyrillic characters)
    • Base 10 digits (0 through 9)
    • Nonalphanumeric characters
    • Any Unicode character that is categorized as an alphabetic character but is not uppercase or lowercase. This includes Unicode characters from Asian languages.
  • Password must not contain any white spaces.
  • Password must not contain your user ID.

Trouble shooting: if you can't successfully sign up

You need to have an academic unit email to sign up. We maintain a growing list of allowed email domains, e.g. emails ending with .edu or edu.ac or edu.tw, and so if you find you can't successfully sign up with your academic email then it's probably your email domain is not on the list. In this situation, please submit an account request by clicking on the "Request Account" button on the bottom of the Sign Up page. 

Sign In


You will receive an email in the email that you registered. Go to your email box, and follow the activation link in the email to activate your account.
Now you have created an account on the Portal, you can sign in. On the top right of the Portal page, sign in with your username and password.

 



 

Browse public worksets from other users and your previously created worksets.

 Browse Workset

Sign in in the portal, in the Portal home page, click on "Browse Workset" on the bottom or click on the navigation bar on top of the page following "Worksets -> List", to browse the public worksets and your own worksets.

You will be navigated to the page listing all the worksets that are public or created by you.

You can click on the hyperlinks to see volumes information of each workset, and can also click on the "Edit" and "Download" button to edit volume membership of a workset and download volume IDs respectively.

 

 

 

Create a workset of interest to you by searching and selecting volumes in the Workset Builder. 

 Create Workset

This page describes how to create a workset in the HTRC Workset Builder.
 Navigate to the Workset Builder

The workset creation functionality allows users to create a workset with the volumes they are interested for further text analysis. On the Portal home page, click on the button "Worksets -> Create Workset" on the top navigation bar, or click on "Browse Workset" button on the bottom left of the page, to go to page for creating a workset.

Note: You will be led to the HTRC Workset Builder system, a system accessible through the portal, for building your workset. That also means you need to sign in again once arriving at the Workset Builder.

You will then see a pop-up window, click on the "Go" button. This will take you to the page for signing in to the Workset builder. You will need to sign in again at the Workset Builder system using the same username and password as in the Portal. 

 

 Sign In to the Workset Builder

When arriving the Workset Builder, you will be asked to sign in again. Sign in with the same username and password as in the Portal. To sign in, click on the "Login" button on the top right of the page.

You will be asked to enter your username and password on the next page. Input the same user name and password as the one in Portal.

 

Then you will be navigated to the approval page. Click on the "Approve" button. Alternatively you can also click on the "Approve Always" button.

 

 Query the Volumes

Simple search: Once in the Workset Builder, you can query in the Search box on the page. Enter search terms in the search box.

For the entered search terms, you can select which field to query them in. The available fields are Full Text, Title, Author, Subject. Then click on the "Search" button to obtain search results. For example, below is a query of the search term"Shakespeare" in the Author field. 

You will see all the volumes with Shakespeare as author shown up on the result page.

 

 Select Volumes from Search Results

A search usually returns you many volumes and you can go through the results to select the ones of your interest. Click on the checkbox next to a volume title on the page to indicate that a volume needs to be included in your workset.

The "Selected Items" on the top right of the page reflects the number of volumes you have selected. In this example, we select the first two volumes in the search result to put in a workset, and you can see that the text says "Selected Items (2)", indicating that two volumes have been selected.

 

 Add Selected Volumes to A Workset

After selecting volumes, click on the "Selected Items" button on the top right of the page to put them in a workset. 

That takes you too the Selected Items page where your selected volumes are listed. You can add these volumes to a workset.

You have two options with regard to where to add the selected volumes: create a new workset for them, or put them to a pre-existing workset. For either option, you need to click on on the "Create/Update Workset" button right above the volume showing section. This takes you to the page for creating a new workset or adding to a pre-existing one.

Please see below for operations for the 2 options.

Option 1: Create a new workset

After clicking the "Create/Update Workset" button on the Selected Items page, you are led to this page. Fill in the fields on the page for the workset creation. 

For the "Availability" field, you can choose Public if you want to make the workset publicly accessibly, or set it to Private if you only want yourself to see it.

Option 2: Replace a pre-existing workset

Alternatively, you don't need to create a new workset, but replace volumes of a pre-existing workset using with the selected items.

After clicking the "Create/Update Workset" button on the Selected Items page, on the page you are led to, you can choose an existing workset listed under the "Update an existing workset" section, and then click the "Update" button.

Warning: this action will erase all the volumes in the existing workset and replace them with the newly selected items.

 

 Manage Workset

You can manage your own or public workset by using the "Manage Workset" functionality. In the Workset Builder, click on "Manage Workset" on the top right part of the page.

You will be led to the Manage Workset page. On the page, in the drop down list, select the workset you want to manage. After selecting one, click on the "Open" button to open the Workset. Here we are using the "ontology_workset", a public workset.

The Workset page (i.e. the Selected Items page for this workset) lists the volumes of this workset. You can remove an unwanted volume by deselecting the checkbox next to a volume. Below shows deselecting the first volume in the workset. Here the first two items are deselected.

 After deselecting the unwanted volumes, click on the "Create/Update Workset" button to save the change. 

On the next page, choose the workset you want to update. Since we are working with the "ontology_workset" workset here, we select the "ontology_workset" to update.

It is done by select the "ontology_workset" in the drop down list under the "Update an existing workset" section, and then clicking the "Update" button.


 

 

 Switch Between Workset Builder and Portal

Sometimes it's necessary to switch between the Workset Builder and Portal to perform your work. For example, you may want to create a workset in the Workset Builder and then run text mining algorithms on the workset in the Portal. You can switch between the Workset Builder and the Portal by following the hyperlinks on the pages.

Workset Builder -> Portal

In Workset Builder, click on the "Portal" button on the top right of the page to go to Portal.

Portal -> Workset Builder

In Portal, you can go to the Workset Builder by either clicking on "Worksets -> Create Workset" on the top menu bar, or clicking on "Create Workset" on the bottom left of the Portal homepage.

 


Choose text analysis algorithms and obtain results in the portal

 Run Text Analysis
 Select an Algorithm and Submit a Job

The HTRC Portal provides a number of text analysis algorithms for users to analyze their worksets, ranging from simple word count to sophisticated ones such as topic modeling. Typically, you need to choose an algorithm of your interest and then select the workset you want to run analysis on with proper parameter settings.

In the Portal, navigate to the page for text analysis by clicking on "Algorithms" on the menu bar on the top part of page.

On the Algorithms page, select an algorithm from the list. You can read the description to learn about what the algorithms can do on your workset. In this example we choose the "Meandre_Topic_Modeling" algorithm.

Click on the "Meandre_Topic_Modeling" algorithm link, and on the next page fill in the needed parameters. Some parameters have default values while for some others you will need to fill them in. You will also need to select a workset (i.e. a collection) that you want to work with. Below shows the parameters entered for this demo. Click on the "Submit" button to submit the job.

After submitting, you will be led to the Results page for viewing job status.

You can stay on the Results page and refresh the page to see the most up-to-date status of the job. See the Examine Results page for more details of the Results page. 

 

 Algorithms Description

 

Descriptions of the text analysis algorithms available in the HTRC Portal are listed below. They are also available on the Algorithms page in the Portal.

Thanks to Sayan Bhattacharyya, Loretta Auvil, Harriett Green, Thomas Padilla, Erica Parker, and others for their input on the descriptions. 

 

#NameSimple descriptionTechnical descriptionAuthorVersion
1

EF Rsync Script Generator

 

 

Generate a script that allows you to download extracted features data for your workset of choice. The script can be run locally, listing the Rsync commands to access the volumes of the workset.

Result of job: script to download extracted features data files

Generates a script to download the extracted features (EF) data for the specified workset using rsync. For more information on the extracted features data see https://analytics.hathitrust.org/features.

Note: Extracted features data was not created for a small number of volumes, so it is possible that not all of your workset volumes will be processed.

Colleen Fallaw2.0
2MARC Downloader

Download the bibliographic information for each volume in your workset of choice.

Result of job: zip (compressed) file that, when downloaded and expanded, contains the bibliographic metadata for each volume

Takes a workset as input and outputs the MARC (Machine Readable Cataloging) record for each volume in the workset in MarcXML format.Zong Peng1.5
3Meandre Classification Naive Bayes

Classify the volumes in a workset into categories of your choosing. Naïve Bayes classification is based on Bayes' Theorem from statistics, and uses machine learning to estimate the correct classification for a volume based on information present in the volumes of each particular class. You will need to upload a custom csv file with volumes classified in order to use this algorithm. Note that currently you cannot use this algorithm to classify an unlabeled workset, and therefore it does not work as fully as expected.

Result of job: text files of the results of the classification and the confusion matrix for the model

Performs Naïve Bayes classification on a workset (uploaded as a csv file) that contains the workset’s volume identifiers as a "volume_id" attribute and labeled data as "class" attribute.

The algorithm:

  • loads each page of each volume in the workset;
  • removes the first and last line of each page;
  • joins hyphenated words (if any) occurring at the end of each remaining line;
  • removes all tokens that do not consist of alphanumeric characters;
  • performs part-of-speech tagging (selecting nouns, verbs, adjectives and adverbs);
  • lowercases all tokens and counts the tokens remaining for each volume, eliminating all tokens with a count of 1;
  • the number of attributes (specified by the paramter num_attributes) to use is noted by the algorithm; the algorithm removes any attributes that exist in only one volume and also removes any attributes that exist in all volumes;
  • splits the data randomly into a training set (60% of the data) and a test set (40% of the data);
  • uniform discretization of frequency counts of the tokens is performed, by creating 2 equally spaced bins between the minimum and maximum for each scalar column; a NaiveBayes model is created on the training set, calculates the accuracy of the training set and test set and outputs the confusion matrix; the predicted value of the volumes in the training and testing sets are also outputted.


Note: Can only take up to 1000 volumes.

Loretta Auvil1.0
4Meandre Dunning Log-likelihood to Tagcloud

Compare and contrast two worksets by identifying the words that are more and less common in one workset, called the analysis workset, than in another workset, called the reference workset.

Result of job: tag cloud visualizations and lists of most and least commonly shared words in csv format

 

This algorithm:

  • calculates Dunning Log-likelihood based on two worksets provided as inputs: an “analysis workset” and a “reference workset”
  • this major functionality was developed as part of the Monk Project
  • loads each page of each workset, removes the first and last line of each page, joins hyphenated words that occur at the end of the line;
  • performs part of speech tagging (selecting only NN|NNS|JJ.*|RB.*|PRP.*|RP|VB.*|IN);
  • lowercases the tokens remaining;
  • counts the tokens remaining for all volumes for each collection;
  • compares counts from each collection using the Dunning Log-likelihood statistic; the "overused" tokens in the analysis collection (relative to the reference collection), 200 tokens by default, are displayed as a tag cloud and made available via a csv file; the "underused tokens" (also 200 tokens by default) in the analysis collection relative to the reference collection are, likewise, displayed as a tag cloud and made available via a csv file.

Note: The upper limit on the number of volumes is 1000.

Loretta Auvil1.1
5Meandre OpenNLP Date Entities To Simile

Visualize the dates in a workset on a timeline. Each date (ex. May 4, 1803) is displayed with its unique HathiTrust Digital Library volume identifier, the page on which it occurred, and a snippet of the sentence in which it occurred.

Result of job: timeline visualization

Information extraction is used to extract date entities that can be displayed on a timeline. This allows a researcher to review sentences that include dates via the timeline. We are using the OpenNLP system to automatically extract the entities from the text. The date entities, and the sentences in which they exist, are then displayed in Simile Timeline.

The algorithm

  • loads each page of each volume;
  • removes the first and last line of each page;
  • joins hyphenated words that occur at the end of the line;
  • for cleaning purposes, puts spaces around each of the following characters ", . ( ) [ ]";
  • extracts date entity types from the text;
  • displays each entity with the volume_id, page_id, and sentence snippet in Simile Timeline;

Note: The upper limit on the number of volumes is 100.

Loretta Auvil1.1
6Meandre OpenNLP Entities List

Generate a list of all of the names of people and places, as well as dates, times, percentages, and monetary terms, found in a workset. You can choose which entities you would like to extract.

Result of job: table of the named entities found in a workset

This algorithm

  • extracts named entities and provides this information in a table; (we are using the OpenNLP system to extract the entities from the text in an automated fashion);
  • loads each page of each volume from HTRC;
  • removes the first and last line of each page;
  • joins hyphenated words that occur at the end of the line;
  • extracts entity types specified from the text;
  • displays each entity with the volume_id, page_id, sentence_id and character position within the sentence.

Note: The volume limit is 100.

Loretta Auvil

1.1

7Meandre Spellcheck Report Per Volume

Find misspelled words that are the result of OCR errors in the text of a workset's volumes, with suggested replacements. Currently the replacements cannot be made within the Portal.

Result of job: lists of the misspellings in a workset, the number of times they occur, and suggested corrected spellings

The algorithm:

  • loads each page of each volume in the workset supplied to the algorithm as a parameter;
  • performs lowercase transformation of text;
  • provides several spelling statistics at a volume level for the html report; for the text file reports, information for each volume is displayed with a blank line separating the volumes; there are options to customize the dictionary, token counts, and transformation rules.

The token counts data is used to determine if a suggested dictionary word occurs in the token counts data, and whether it should be used. There are options for customizing the transformation rules which indicate the types of OCR errors that should be corrected. For instance, a known problem is the transformation of an "li" to an "h" and vice versa, and this is expressed with the transformation rule "li=h", which says that, for all misspelled words with an "h", a check will be done to see if a conversion to "li" forms a correctly spelled word.

Note: The upper limit on the number of volumes is 100.

Loretta Auvil1.1
8Meandre Tagcloud

Ceate a tag cloud visualization of the most frequently occurring words in a workset, as well as a list of the most frequent words. In a tag cloud, the size of the word is displayed in proportion to the number of times it occurred.

Result of job: list of most frequent words and a tag cloud visualization of them

This algorithm:

  • performs token counts and displays the most frequent tokens in a tag cloud;
  • counts the tokens for all volumes and displays the top 200 tokens in a tag cloud.

Notes:

  • No cleaning of the text is performed.
  • The upper limit on the number of volumes is 1000.
Loretta Auvil1.1
9Meandre Tagcloud with Cleaning

Performs cleaning of the text before it allows you to create a tag cloud visualization of the most frequently occurring words in a workset. In a tag cloud, the size of the word is displayed in proportion to the number of times it occurred.

Result of job: list of most frequent words in a workset's cleaned text and a tag cloud visualization of them

Performs token counts with some additional text cleaning and displays the most frequent tokens in a tag cloud.

  • performs token counts with some additional text cleaning and displays the most frequent tokens in a tag cloud;
  • loads each page of each volume from HTRC; Removes the first and last line of each page;
  • joins hyphenated words that occur at the end of the line;
  • performs lowercase transformation of text;
  • removes all tokens that don't consist of alphanumeric characters;
  • uses the replacement rules (learned from our usage of Google Ngrams data) to clean OCR errors, normalize to British spelling and normalize for period spelling;
  • filters stop words;
  • counts the tokens remaining for all volumes and displays the top 200 tokens in a tag cloud.

Note: The upper limit on the number of volumes is 1000.

Loretta Auvil1.1
10Meandre Topic Modeling

Identify "topics" in a workset based on words that have a high probability of occurring close together in the text. Topics are models trained on co-occurring text using Latent Dirichlet Allocation (LDA), where each topic is treated as a generative model and volumes are assigned a probability of how likely each topic is to have generated that text. The most likely words for a topic are displayed as a word cloud.

Result of job: xml file with topics, and visualizations of them in the form of tag clouds.


Performs topic modeling analysis in the style of Latent Dirichlet allocation (LDA) and its variants, notably the form used in Mallet.

This algorithm:

  • loads each page of each volume from HTRC;
  • removes the first and last line of each page;
  • joins hyphenated words that occur at the end of the line;
  • removes all tokens that do not consist of alphanumeric characters
  • filters stop words;
  • replaces "not " with "not_" to deal with negations;
  • creates a topic model using Mallet;
  • displays the top 200 tokens in a tag cloud. 


Note: The upper limit on the number of volumes is 1000.

Loretta Auvil1.1
11Simple Deployable Word Count

Identify the words that occur most often in a workset and the number of times they occur.

Result of job: list of the most frequent words in a workset and the number of times they occur

A simple word count Java client that uses the HTRC Data API to access the volumes in the specified workset, and displays the top N most frequently occurring words within that workset.Yiming Sun1.4

 

 

 

Examine text analysis job status and results

 Examine Results

During and after submitting a text analysis job, you can view job status and results on the "Results" page. 

Right after submitting a job, depending on how fast the job is computed, you can probably see the job listed in the "Active Job" section. You can also see all the active jobs submitted by you.

After the job is finished, you can see the job is listed in the "Completed Jobs" section. Click on the job name to see its result.

On this demo job result page, I clicked on the "topic_tagcloud.html" link to see the tag cloud of the topics.

 

 

 

 

Create a workset by supplying a list of volume IDs

 Upload Workset
 Upload Workset using CSV

Downloading a workset

After you have created a workset using the Workset Builder, you can download it as a list of volume identifiers in comma separated value (csv) format. Because each workset is functionally a list of pointers to content in the HathiTrust Digital Library, the full text of the volumes is not included in the download. If you are interested in receiving a dataset from the HathiTrust to do research on your own machine, please refer to the directions for requesting a custom dataset. The volume identifiers in a workset are consistent with the volume identifiers used elsewhere across the HathiTrust.

 

From the homepage of the Portal, sign in and then navigate to either "My Worksets" or "View All Worksets"

 

Click "Download" next to the workset you would like to download to your computer.

 

 

Uploading a workset

If you already have a list of identifiers for volumes you would like to analyze, including if you have downloaded the metadata for a collection in the HathiTrust Digital Library, you can create a workset based on that list by uploading the volume identifiers in a comma separated value (csv) file. 

If you are trying to run the Naïve Bayes classification algorithm in the Portal, you will need to upload a workset with the volume identifiers in one column ("volume_id"), and the labels you would like to apply to each volume in the workset in another ("class").  Otherwise, if you aren't planning to use the classification algorithm, only one column is needed ("volume_id"). 

To upload, click on "Worksets" then "Upload Workset" on the top menu bar, or click on "Upload Workset" at the bottom of the homepage of the Portal.

 

A pop-up menu will open. Fill in the fields with the proper values, then click "Submit." If you click the box to make the workset private, it will only be available to you. Otherwise, the workset will be public and other users will be able to access it. 

 

 A Special Case: Upload Labeled Workset Using CSV

 

Unable to render {include} The included page could not be found.

 

Use the HTRC Data Capsule for non-consumptive research on HT corpus

Data Capsule Documentation

 

 

  • No labels