Child pages
  • HTRC User Getting Started FAQ

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


Table of Contents

Introduction

...

Q1: What is the HTRC?

A: The HTRC is the research arm of of the HathiTrust. It facilitates scholarly research using the large-scale HathiTrust Digital Library by providing mechanisms for researchers to access content in the HathiTrust  and study it using computational tools for text analysis.

The HTRC is a partnership between Indiana University (IU) Libraries, the Pervasive Technology Institute, and the School of Informatics and Computing at IU, and as well as the University of Illinois , at Urbana-Champaign (UIUC) Libraries , and the Graduate School of Library and Information Science at UIUC.

...

Q2: What are the HTRC

...

tools and services?

...

A: We have The HTRC has created a couple of platforms for you to experiment with. The main HTRC services (sometimes referred to as the production stack) gives you a Portal and a Workset Builder.

From the Portal you can log in and run analytic algorithms on a set of predefined collections of volumes. These algorithms, powered by the SEASR toolkit, run against the HathiTrust volumes that are in the public domain (close to 3M).

The Workset Builder is a search interface for the Hathitrust public domain corpus - search results can be saved as a 'workset': a collection of volumes against which the text mining algorithms are run.

In addition to the main services, we also provide a Sandbox stack with the same tools. The sandbox runs against non-Google scanned content (about 260,000 volumes). The advantage of the sandbox is that you can access the index and Data API directly, and so you can write your own algorithms.

suite of tools that allow researchers to perform text analysis on content in the HathiTrust Digital Library. Most of these tools are available via HTRC Analytics, and include web-based text analysis algorithms, HathiTrust+Bookworm, and the HTRC Data Capsule. They are intended to meet the needs of various HTRC researchers. 

  • HTRC Algorithms: a set of tools for assembling collections of digitized text and performing text analysis on them.

  • HathiTrust+Bookworm: a tool for visualizing and analyzing word usage trends in the HathiTrust Digital Library.

  • HTRC Data Capsule: a secure computing environment for performing researcher-driven text analysis on HathiTrust content. 

Q3: How do I use the HTRC?

A: The HTRC has several overarching paradigms –worksetsalgorithmsjobs, and results.

  1. Worksets are collections of volumes and other data to be processed. Worksets are built using software that functions like many library catalog systems.  In the Workset Builder application (often referred to as Blacklight), you will be able to search for, view, and select items that you would like to process.
  2. Algorithms are research methodologies expressed in executable code; that is, they are programs that will run one or more function against your workset. You can choose from a set of algorithms that have been integrated into the HTRC. You can customize the parameters for each algorithm.
  3. Jobs: When you hit submit, you are submitting a job. A job is a set of instructions that are executed by one of the computing resources available to the HTRC. You can view the status of the jobs that you have submitted. You can also delete jobs. If you find that you have made an error in your set up, you can delete the job.
  4. Results: When your job has completed, you can view the results of the job. The results can be viewed in the HTRC. You can also download the results.

Q: What types of data and metadata does HTRC provide?

A: HTRC currently has the public domain corpus OCR text from HathiTrust, along with MARC and METS XML. 

Access and Services

Q: How do I obtain an account to access HTRC Production Portal?

A: You may sign up You use the HTRC by interacting with our tools and services. Please refer to the documentation for each tool or service for more specific how-to guides.

Q4: Do I need an account to use the HTRC and how do I make one?

A: Most of the HTRC services require an account to log in and interact with the tools, though HathiTrust+Bookworm is available without an account.

Register for an account by going to the HTRC Production Portal http://htrc2.pti.indiana.edu and choose main page of the HTRC Analytics and choosing "Sign up" from the menu. Anyone possessing an email address from a nonprofit institution of higher education is allowed to register, including those whose institutions are not HathiTrust members. 

...

Q5: What is the difference between using the HTRC and searching the HathiTrust Digital Library?

...

A: This table lists the HTRC Production Stack entries

ServiceEndpointComments
Portalhttp://htrc2.pti.indiana.edu The portal allows you to browse volume lists and algorithms, execute algorithms, and view results
Blacklighthttp://sandbox.htrc.illinois.edu:8080/blacklightThe Blacklight search interface allows you to search for volumes, and create volume lists that can be used by algorithms.  It provides a GUI interface to our Solr index

Q: How do I obtain an account to access the HTRC sandbox?

A: You can send an email to htrc-tech-help-l@list.indiana.edu (a list subscribed by HTRC internal staff only) to request for an account, along with your name, your contact information, and indicate that you would like to access the HTRC Sandbox. 

Q: How do I access HTRC Sandbox?

A: This table lists the HTRC Sandbox entries

ServiceEndpointComments
Portalhttps://sandbox.htrc.illinois.edu/HTRC-UI-Portal2The portal allows you to browse volume lists and algorithms, execute algorithms, and view results
Blacklighthttps://sandbox.htrc.illinois.edu/blacklightThe Blacklight search interface allows you to search for volumes, and create volume lists that can be used by algorithms.  It provides a GUI interface to our Solr index
Data APIhttps://sandbox.htrc.illinois.edu/data-apiThe HTRC Data API provides access to the corpus data and METS XML via a RESTful web service
Solr Proxyhttp://sandbox.htrc.illinois.edu/solrThe HTRC Solr Proxy provides access to the Solr index. A sample query is: http://sandbox.htrc.illinois.edu/solr/ocr/select?q=shakespeare please refer to the Solr Guide for more details on query.

Recent addition: HTRC Bookworm: http://sandbox.htrc.illinois.edu/bookworm 

Q: What are the differences between the Production Stack and the Sandbox?

A: This table outlines the differences between the Production Stack and the Sandbox:

 Production StackSandbox
purposea distributed service oriented cyberinfrastructure to support various digital humanities researches and text analysis of HTRC membersa community asset meant to be open to the community and for interested users to try things out on a smaller scale
number of machines91
corpusfull public domain setnon-Google scanned public domain subset
number of volumes2.7 million250,000
compute resourcea separate 128-node clusterlocal on the Sandbox
accountspersonal accountpre-defined account pool
account reclamationnoyes (reclaimed and reassigned after 30 days of inactivity)

Q: What is the HTRC Solr Proxy and how is it different from Apache Solr?

A: The HTRC Solr Proxy is a thin service in front of Apache Solr services for security and auditing purposes. The Solr Proxy filters requests to allow read-only requests to protect our indices from being modified; other than that, it is fully compatible with Apache Solr. Please see Solr Proxy API User Guide

Using the search on the hathitrust.org site, you can find digitized items in the HathiTrust Digital Library (HTDL) and to read them if they are in the public domain. From the HTDL, you can create collections that you are able upload to HTRC Analytics as a workset. With the HTRC tools you can work with material from the HathiTrust Digital Library at scale, using computational methods to analyze collections of content, called worksets in HTRC, relevant to your research.

Q6: What types of data and metadata does HTRC provide?

A:  The availability of data and metadata in HTRC depends on the tool or service.  

  • HTRC algorithms and HTRC Data Capsules currently provides access to a snapshot of the public domain corpus OCR text from  HathiTrust, as well as each volume’s MARC bibliographic and METS metadata.Both the HTRC algorithms and Capsule-environments draw from the HTRC Data API described below.

  • The HTRC makes available also two datasets, the HTRC Extracted Features Dataset and a dataset of Word Frequencies in English Language Literature, 1700-1922. HTRC Extracted Features includes metadata and extracted page-level data (words and word counts) for 13.7 million volumes.

  • HathiTrust+Bookworm visualizes data for 13.7 million volumes.


Q7: What is the difference between the HTRC Data API and HathiTrust

...

datasets and APIs?

A: This table outlines the differences between the HTRc HTRC Data API and HathiTrust Data API. The HTRC Data API currently functions within the HTRC Data Capsule.

 HTRC Data APIHT Data API
purposeto serve high-performance large-scale algorithms and programsto provide public users some volume retrieval capabilities
throttling enforcementnoyes
security
OAuth2
JWTOAuth
bulk retrieval of volumesyesno
metadata availableMETSMETS, MARC

Q: What is HTRC's non-consumptive research? The HTRC Data Capsule

 A: The HTRC Data Capsule provides a researcher with a virtual machine that the user configures as needed.  This includes loading necessary software packages and data sets.  When they are ready to run their analysis, they switch their data capsule from maintenance mode to "secure mode", and the routines run in a secure mode that does not allow content from the HT repository to leak out.  When completed, the researcher receives an email giving them the location from which to download the results. The HTRC Data Capsule is in alpha version and undergoing internal testing. 

Q: How do I use the HTRC Data API? 

A: Please see HTRC Data API Users Guide

Q: How do I create and analyze worksets in the portal?

Worksets are collections of volumes from our collection. There are currently two types of workset: basic and labeled. Basic worksets can be created with the Workset Builder or with the upload CSV functionality, labeled worksets can only be added by uploading a CSV.

Creating worksets with the Workset Builder

The easiest way to create a basic workset is to use the Workset Builder. The Workset Builder allows you to search across our collection. In the search results, note that there is a select button:

Image Removed 

All the items that you select are kept in the Workset Builder. To review them, click "selected items" in the navigation bar. This is meant as a workspace for building a volume list for the workset, to save a workset of these items: click "Create/Update workset":

Image Removed

When you're saving a workset, note that it can be saved publicly (viewable by all users) or private. After saving a workset, it will be available in the  HTRC Portal, for use in analysis or for download.

Building labeled worksets

While a basic workset simply collects volumes in one place, it is possible to add classes to worksets. This allows for use with classification algorithms, such as Naive Bayes.

The CSV can be built in your preferred way. One common approach is to

  1. build a basic workset in the Workset Builder
  2. download the basic workset
  3. open the workset in the HTRC CSV Editor prototype (or a spreadsheet app of one's choosing)
  4. In the CSV Editor or spreadsheet:
    1. A 'class' column can be added and filled in
    2. Additionally CSVs can be appended
    3. Manual volumes can be added (by looking up the "Volume_id" in the Workset Builder)
  5. The output of the HTRC CSV Editor or saved spreadsheet can be uploaded to the HTRC Portal

A labeled workset CSV should follow the following style:

  • the first line should be a header (or names of each column);
  • the first column should be a volume id, and the second column should record the label of the volume.

Below is an example of what the CSV file looks like. Given some volumes, classes are assigned to them based on some criteria. For example, here the labels are the names of the authors of the volumes:

volume_id, class
mdp.39015001796500,Austen
uc2.ark:/13960/t42r3rg51,Austen
uc2.ark:/13960/t3dz03x48,Austen
uiuo.ark:/13960/t4km00443,Austen
mdp.39015004997253,Austen
uc2.ark:/13960/t6c24sq2z,Austen
uc2.ark:/13960/t6m041m4z,Dickens
uiuo.ark:/13960/t5cc1pz8f,Dickens
uiuo.ark:/13960/t1wd47104,Dickens
uc2.ark:/13960/t2v40sj3m,Dickens
uiuo.ark:/13960/t3tt5bm1x,Dickens
uiuo.ark:/13960/t6n013296,Dickens

Uploading Worksets

Worksets are uploaded in the HTRC Portal, under Worksets > Upload Workset, or with the '+' button in the workset list view. This is an alternative to the Workset Builder, and currently the only way to add labeled worksets.

Notes:

...

Q: Should I save my results in the portal?

A: If you want to ensure that results are retained through a restart of the services, then you should save your results.

Q. What is the login timeout?

* The current login timeout is 12 hours.

Support

Q: Where do I go for more information?

A: Below are links to some very useful documentation:

Q: This is a release. Can I download the code?

A: Yes. All of the HTRC services code modules are open source and are available from SourceForge. Go to http://sourceforge.net/p/htrc/code/ to browse the code, or check out directly from SVN using:

Code Block
svn co svn://svn.code.sf.net/p/htrc/code/

...

Q8: What happened to the HTRC Solr Proxy API?

A: As the HTRC moves to update and improve its search and workset-building services, the Solr Proxy API has been retired. For now, you can search for HathiTrust volumes via the HathiTrust Digital Library interface. Look for improved functionality in the near future, and please reach out with your workset-building scenarios that require additional search functionality. 

Q9: How do I ask questions or start discussions with other users?

A: Please  Please join the HTRC Usergroup User Group mailing list.

Q:  How do I contribute code to HTRC?

A:  HTRC has a GitHub set up for browsing contributed code.   It is at https://github.com/htrc

...

  • .
  • All users are subscribed to a listserv called HTRC-Announce when they create an HTRC Analytics account. Only approved senders can send mail through this list.

Q10: How do I report issues or give feedback?

A: To report a bug, please go to http We welcome your feedback! You can send an email to HTRC Support at htrc[dash]help[at]hathitrust[dot]org. We track support requests in using JIRA, and you can log-in to see your requests and our responses here: https://jira.htrc.illinois.edu/browseservicedesk/HTRC. You need to create a JIRA login account if you have not done so already. To provide feedback, you may use the "feedback" tab found on the right-hand side of various portal pages to pop up a formcustomer

Q11: Where do I go for more information?

A: If you have not found what you are looking for in our documentation, you might find the material posted to our Publications and Presentations page useful for further reading.

You might also consider attending a workshop. You can find information on future workshops on our calendar.

Or you can ask for further assistance on our mailing lists. See below for more information about signing up.