Skip to end of metadata
Go to start of metadata

These are Frequently Asked Questions (FAQ) about using the Portal and Workset Builder. See here for the general HTRC FAQ.  


Q: How do I get to the Portal and Workset Builder?

A: The front of the Portal is available at https://analytics.hathitrust.org/. Access the Workset Builder through the Portal after you have logged in. The Portal is also the gateway to the HTRC Data Capsule, and it has links to the primary HTRC-provided datasets as well.

Q: What are the Portal and Workset Builder?

A:  The Portal and Workset Builder are complementary tools for assembling subcollections of volumes from the HathiTrust Digital Library and studying them using off-the-shelf tools for text analysis. The Portal is the entry point for a number of HTRC services, including the Workset Builder, suite of algorithms, Extracted Features Dataset, and the HTRC Data Capsule. The Workset Builder is a search tool for finding content in the HathiTrust Research Center relevant to one’s research questions and compiling it into a subcollection, called a workset, for analysis. 

Q: How do I create an account to log in to the Portal?

A:  You can create an account by going to the Portal homepage and clicking “Sign Up” in the top right corner.  Anyone possessing an email address from a nonprofit institution of higher education is allowed to register, including those whose institutions are not HathiTrust members. 

Q: What are worksets and what do I do with them?

A:  Worksets are subcollections of HathiTrust content created by researchers. Create a workset by using the Workset Builder to search the metadata and full-text OCR of the public domain materials in the HathiTrust Digital Library and compiling the volumes of interest to you. You can run HTRC algorithms against worksets in order to analyze them or download their Extracted Features. Worksets can be cited, and researchers can choose to make their worksets public or private.

Q: What terms do I need to know to get started using the Portal and Workset Builder?

A: In addition to worksets, the HTRC Portal has several overarching paradigms – algorithmsjobs, and results.

  1. Algorithms are research methodologies expressed in executable code; that is, they are programs that will run one or more function against your workset. You can choose from a set of algorithms that have been integrated into the HTRC. You can customize the parameters for each algorithm.
  2. Jobs: When you hit submit, you are submitting a job. A job is a set of instructions that are executed by one of the computing resources available to the HTRC. You can view the status of the jobs that you have submitted. You can also delete jobs. If you find that you have made an error in your set up, you can delete the job.
  3. Results: When your job has completed, you can view the results of the job. The results can be viewed in the HTRC. You can also download the results.

Q: What kinds of text analysis services and/or algorithms are in the Portal?

A:  The Portal includes access to 11 off-the-shelf HTRC algorithms that facilitate text analysis. These algorithms help you extract, refine, analyze, and visualize the content of a workset. Please see this description of the algorithms provided via the HTRC Portal for more information.

Q: Is the Workset Builder the only way to create a workset?

The Workset Builder is primarily a tool for searching and compiling volumes in order to create a workset. If you already know the IDs of the volumes you would like to include in a workset, or if you are interested in using an algorithm that requires a labeled workset, such as the Meandre Naïve Bayes classification algorithm, you can upload a workset in the form of a CSV file. Basic (unlabeled) worksets can be created with the Workset Builder or with the upload CSV functionality, while labeled worksets can only be added by uploading a CSV.

Building labeled worksets

Labeled worksets include both volume IDs and classification terms that allow them to be used with classification algorithms, such as Naïve Bayes. There are multiple ways to build a labeled workset, but one common approach is to:

  1. Build a basic workset in the Workset Builder
  2. Download the basic workset
  3. Open the workset in the HTRC CSV Editor prototype (or editor of your choice)
  4. In the CSV Editor or spreadsheet you can:
    1. add terms to the "class column"
    2. append multiple CSVs
    3. add volume IDs manually
  5. Save the CSV file and upload it to the HTRC Portal

A labeled workset CSV should have a header line first with the names of the columns. One column should have the volume ID number, and the other should have the label you've assigned to the volume.

volume_idclass
mdp.39015001796500

Austen

uiuo.ark:/13960/t6n013296Dickens

One way to find the title/content of a book, while assigning classes to volumes, would be to get it from http://babel.hathitrust.org/cgi/pt?id=mdp.39015033434559;view=1up;seq=1 (by substituting the volume id in this URL with the desired volume ID).

Uploading Worksets

Worksets are uploaded in the HTRC Portal. Under Worksets in the menu bar, choose Upload Workset. Or when viewing the list of worksets in the portal, click the "+" button to upload your own.

Note: The worksets in the portal and in the csv file display the volumes in different orders. (We are working on a fix to this issue.) You need to be alert to this so that you do not assume that the worksets in the portal and in the csv file would obey the same order. (If you assume that, then you may end up referring to the order displayed in the portal when assigning classes to the volumes specified in your CSV file, which could lead to problems.)

Q. What is the login timeout?

A: The current login timeout is 12 hours. However, your submitted job won't be affected by this logout time. It will still run even if you log out or if the system logs you out.


  • No labels