When something goes wrong, this page is here to help you identify and hopefully remedy the problem. Please contact HTRC support at firstname.lastname@example.org with any questions or bugs or peculiarities you would like to report.
HTRC Tools and Services
Q: What is the HTRC?
A: The HTRC is the research arm of the HathiTrust. It facilitates scholarly research using the large-scale HathiTrust Digital Library by providing mechanisms for researchers to access content in the HathiTrust and study it using computational tools for text analysis.
The HTRC is a partnership between Indiana University (IU) Libraries, the Pervasive Technology Institute, and the School of Informatics and Computing at IU, as well as the University of Illinois at Urbana-Champaign (UIUC) Libraries and the Graduate School of Library and Information Science at UIUC.
Q: What are the HTRC tools and services?
A: HTRC has created a suite of tools that allow researchers to perform text analysis on content in the HathiTrust Digital Library. Most of these tools are available via the HTRC Analytics website. They are intended to meet the various needs of HTRC researchers.
HTRC Algorithms: a set of tools for assembling collections of digitized text and performing text analysis on them.
HTRC Extracted Features: an openly-available dataset of metadata and derived data from the HathiTrust corpus.
HTRC Data Capsule: a secure computing environment for performing researcher-driven text analysis on HathiTrust content.
- HathiTrust+Bookworm: a tool for visualizing and analyzing word usage trends in the HathiTrust Digital Library.
Q: Who can use HTRC?
A: Most of HTRC's services require an account on HTRC Analytics to use. Scholars from non-profit institutions of higher education or other research institutions are eligible for an account, and users don't need to be affiliated with a HathiTrust member institution in order to qualify. Some services within HTRC Analytics are further restricted: Access to an HTRC Data Capsule with computational access to items in copyright is available ONLY to member-affiliated researchers who complete a Capsule request form. Others require no account to use, such as the HTRC Extracted Features or HathiTrust+Bookworm.
Q: What is the difference between using the HTRC and searching the HathiTrust Digital Library?
A: Using the search on the hathitrust.org site, you can find digitized items in the HathiTrust Digital Library (HTDL) and to read them if they are in the public domain. From the HTDL, you can create collections that you are able upload to HTRC Analytics as a workset. With the HTRC tools you can work with material from the HathiTrust Digital Library at scale, using computational methods to analyze collections of content, called worksets in HTRC, relevant to your research.
Q: What types of data and metadata does HTRC provide?
A: The availability of data and metadata in HTRC depends on the tool or service.
- HTRC algorithms and HTRC Data Capsules currently provides access to a snapshot of the public domain corpus OCR text from HathiTrust, as well as each volume’s MARC bibliographic and METS metadata.Both the HTRC algorithms and Capsule-environments draw from the HTRC Data API described below.
- The HTRC makes available also two datasets, the HTRC Extracted Features Dataset and a dataset of Word Frequencies in English Language Literature, 1700-1922. HTRC Extracted Features includes metadata and extracted page-level data (words and word counts) for 13.7 million volumes.
- HathiTrust+Bookworm visualizes data for 13.7 million volumes.
Q: What is the login timeout for HTRC Analytics?
A: The current login timeout is 1 hour. However, your submitted job won't be affected by this logout time. It will still run even if you logout or if the system logs you out.
Q: What are worksets and what do I do with them?
A: Worksets are sub-collections of HathiTrust volumes created by researchers. You can run HTRC algorithms against worksets in order to analyze them or download their Extracted Features. Worksets can be cited, and researchers can choose to make their worksets public or private. Learn more about worksets.
Q. How do I create a workset?
Q: Can I analyze non-HathiTrust data alongside HathiTrust data?
A: Within the HTRC Analytics platform, only in the HTRC Data Capsule environment. HTRC Algorithms function only on "worksets," which are user-created collections of content from the HathiTrust Digital Library. You can import outside data to your Capsule when it is in maintenance mode, though, and work with it within that system. You can also make use of HTRC Extracted Features alongside if you prefer to work on your local desktop only.
Q. What is the HTRC Data Capsules environment and what can it be used for?
A. The HTRC Data Capsule environment provides a secure computing environment to access content in the HathiTrust Digital Library. Users are provisioned virtual machines called capsules to which they can import and then analyze HathiTrust volumes. Users can only perform computational analysis within the secure Data Capsule environment and then export the results of their analysis. Users cannot export volume content outside the HTRC Data Capsule.
Q: Do I have computational access to the HathiTrust Digital Library's copyrighted content in Data Capsule?
A: Computational access to items in copyright is available ONLY to HathiTrust member-affiliated researchers. Existing Data Capsule users from member institutions or new Data Capsule requesters from member institutions have the exclusive option to select “Full Corpus Access,” which includes copyrighted items.
Q: HTRC Analytics showed an error message when I tried to create a Data Capsule. What went wrong?
A: Most likely you have reached the maximum amount of space allowed per user in the capsules system. Please delete one of your capsules, or contact HTRC support to solve the issue: email@example.com
Q: I have some Python scripts that I want to use in my analysis within the HTRC Data Capsule. How should I start?
- First, store your Python scripts somewhere on the Internet.
- Start your capsule from within the Analytics interface, and make sure your machine is in maintenance mode.
- Enter your capsule via Terminal viewer or Remote Desktop viewer.
- Download the Python scripts from the Internet onto your capsule.
- Switch to secure mode.
- If you know the volume IDs that you are interested, you can go ahead to fetch content of these volumes by using this sample Python script in Fetching Volume OCR Content in HTRC Data Capsule (Secure Mode).
- Run your Python scripts agains the content.
- If you don't have the volume IDs of your interest, you can search for volumes in the HathiTrust Digital Library. You can search by subject, topic, author, year, etc., and identify the volumes of interest and save your chosen volumes as a collection in HathiTrust. From there, you can either use the HTRC Workset Toolkit to load volumes from the collection in your Capsule, or download the collection's metadata to retrieve the volume IDs for the volumes you have selected.
- Once you have the volume IDs ready, you can go ahead to fetch the volume content in Data Capsule secure mode and perform analysis using your Python scripts as mentioned above.
Q: Can I import the workset that I have used in HTRC Analytics into the HTRC Data Capsule?
A: Currently, there are two ways to do this, depending on whether you have first created a collection in HathiTrust:
- Download the workset from HTRC Analytic in order to export a list of the volume IDs for that workset, and then use the HTRC Workset Toolkit in the Data Capsule to access the content in those volumes. It is not presently possible to export a workset from HTRC Analytics directly into the HTRC Data Capsule, but we expect to integrate this functionality into future versions.
- Load volumes from a HathiTrust Digital Library collection into a Capsule using the HTRC Workset Toolkit using the collection's URL. Directions are available here: https://htrc.github.io/HTRC-WorksetToolkit/cli.html.
Keep in mind which volumes will be available to you within your Capsule, depending on the kind of Capsule you are using and whether it has access to the full corpus or only "full view"/public domain volumes.
Q: Can you tell me exactly how much data I am allowed to export from my capsule?
A: The standard for non-consumptive export depends on the scope and scale of the data analyzed. The general rule-of-thumb is whether the export would create a substitute for human-reading the original text. (The full Non-Consumptive Use Research Policy is also available for your reference.) If you would like someone to pre-review a sample file that would represent the kinds of data you would like to export from a capsule before you begin your work, please contact firstname.lastname@example.org.
Q: How do I use the HTRC Data API?
A: Check out our user's guide for more information about using the HTRC Data API in the HTRC Data Capsule.
Q: What is the difference between the HTRC Data API and HathiTrust Data API?
A: This table outlines the differences between the HTRC Data API and HathiTrust Data API
|HTRC Data API||HathiTrust Data API|
|purpose||to serve high-performance large-scale algorithms and programs||to provide public users some volume retrieval capabilities|
|bulk retrieval of volumes||yes||no|
|metadata available||METS||METS, MARC|
Q: How do I cite HTRC services, tools or data?
A: If you're working with an HTRC dataset, such as Extracted Features, please use the citation guidelines on the documentation pages for those datasets. Whenever possible, we mint DOIs for our datasets and provide information about how to cite them.
The sample citation for the EF Dataset, 2.0 version is:
Jacob Jett, Boris Capitanu, Deren Kudeki, Timothy Cole, Yuerong Hu, Peter Organisciak, Ted Underwood, Eleanor Dickson Koehl, Ryan Dubnicek, J. Stephen Downie (2020).
The HathiTrust Research Center Extracted Features Dataset (2.0). HathiTrust Research Center. https://doi.org/10.13012/R2TE-C227
For HTRC Analytics algorithms or other HTRC tools like Bookworm, here is an example citation (in Chicago Style (17th Ed):
“HTRC Analytics.” Named Entity Recognizer (v2.0). Accessed February 16, 2022. https://analytics.hathitrust.org/algorithms.
For HTRC Data Capsules:
HTRC Data Capsules. Accessed February 16, 2022. https://analytics.hathitrust.org/capsules.
What happened to...?
Q: What happened to the Workset Builder?
A: As HTRC upgrades its services and builds a new Workset Builder, the retired Workset Builder has been taken offline. The new system of creating a collection in the HathiTrust Digital Library better aligns workset-building with the HathiTrust and offers improved search and selection.
Q: What happened to the HTRC Solr Proxy API?
A: As the HTRC moves to update and improve its search and workset-building services, the Solr Proxy API has been retired. For now, you can search for HathiTrust volumes via the HathiTrust Digital Library interface. Look for improved functionality in the near future, and please reach out with your workset-building scenarios that require additional search functionality.
Q: What happened to the HTRC Sandbox?
A: The HTRC Sandbox, which was a space for testing and experimentation in the early days of the project, has been rolled into our production services available here:
- HTRC Analytics: a set of tools for assembling collections of digitized text and performing text analysis on them.
- HTRC Data Capsule: for use of the production-level HTRC Data API
User Accounts and Sign-in
Q: Why isn’t my institution listed on HTRC’s sign-in dropdown menu?
A: As of 2022, HTRC Analytics has updated its sign-in process so that users who have email addresses associated with any HathiTrust member institution or the identity management platform CILogon have the opportunity to login with their institutional username and password, rather than using separate HTRC credentials.
Current and prospective HTRC Analytics users who are not associated with the two organizations listed above will need to continue logging in with separate HTRC credentials (please see Q: How do I create or access an HTRC account if my institution is not listed in the sign-in dropdown? for full details on what to do if your institution is not listed in the sign-in dropdown menu).
Q: How do I log into HTRC Analytics with my institutional credentials?
A: Click the blue Sign-In button in the top-right corner of the HTRC Analytics home screen:
Begin typing in your institution’s name in the text search box, and select your institution when it appears in the dropdown menu:
Click the blue Continue button when you have found and selected your institution.
Upon initial sign-in (and any time you have been fully signed-out of your institution’s platform), you will be redirected to your institution’s login page. Enter your institution’s login credentials. Complete any two-step factor authentication your institution uses. If you are totally new to HTRC, and did not have an account from our old log in system, you will be redirected to an account registration page. Fill in the provided form to successfully create your account. If you had an account from the previous sign in system, you will be prompted to migrate your old account details, or start fresh with a totally new account. You will only be asked to migrate old account details once (please see 'I have a pre-existing account within the old sign-in system. How do I get access to my data from that account in the new system?" below for complete details.) Once completed correctly, you will be fully signed-in to your HTRC account and have access to your profile, data capsules, and worksets.
Q: I have a pre-existing account within the old sign-in system. How do I get access to my data from that account in the new system?
A: Because HTRC Analytics has updated our sign-in process (as of 2022), we have implemented an account migration workflow to migrate your existing account's data to the new system. This is an optional process. By choosing the option to migrate your previous data you'll be able to link all your previous account details and resources (e.g., worksets, data capsules, algorithm jobs, etc.) to the new account.
When you initially sign in using the new HTRC system (i.e., with your institutional or newly created HTRC-hosted (“local”) account credentials), you will be taken to an account migration page. On that page, enter the email associated with your new account to see if it is connected with a previous account. If there is a match, you will have the option to migrate your old account details to the new account.
If we cannot find an existing account with the same email, that means you have used a different email for your past account, or you don’t have an existing account. If you used a different email for your past account, enter it into the text search box and proceed with your migration preferences; if you wish to migrate, you will receive an email with an account migration link. If you are unable to match an email, you will select “I don’t have an account” at the bottom of the form to continue with completing your new account setup.
Q: How do I create an HTRC account and/or login if my institution is not listed in the sign-in dropdown?
A: If your institution is not listed in the dropdown list provided, you will need to create an HTRC-hosted (a.k.a. “local”) account. To do so, click on the “Create an account with HTRC” link at the bottom on the sign-in pop-up:
Fill in the form provided on the following page. If your institutional email is not recognized by our system’s approved account list, you will need to submit a request for institutional account approval before you can create an account.
Once you have successfully created an HTRC local account, you will need to proceed with signing in. To do so, click the “Login with HTRC” link at the bottom of the sign-in pop-up.
Log into your account by entering your newly created HTRC Analytics username and password.
If you had a preexisting account with HTRC, you will have the option to migrate your old account’s data to the new one (please see "I have a pre-existing account within the old sign-in system. How do I get access to my data from that account in the new system?" for full details on how to migrate data).
HTRC Code and Infrastructure
Q: Can I see the code used to make HTRC tools and services operate?
A: Yes. All of the HTRC services code modules are open source and are available from GitHub: https://github.com/htrc.
Q: Where can I learn more about HTRC Data Capsules development project?
A: More information can be found in the pubic version of the final report of the project as well: http://hdl.handle.net/2022/19277
Q: To whom can I direct technical questions?
A: Please email HTRC support: email@example.com.
Get in touch!
Q: Where do I go for more information?
A: If you have not found what you are looking for in our documentation, you might find the material posted to our Publications and Presentations page useful for further reading.
You might also consider attending a workshop. You can find information on future workshops on our calendar.
Or you can ask for further assistance on our mailing lists. See below for more information about signing up.
Q: How do I report issues or give feedback?
A: We welcome your feedback! You can send an email to HTRC Support at firstname.lastname@example.org. We track support requests in using JIRA, and you can log-in to see your requests and our responses here: https://jira.htrc.illinois.edu/servicedesk/customer.
Q: How do I ask questions or start discussions with other users?
A: Please join the HTRC User Group mailing list.
- Subscribe here: https://list.indiana.edu/sympa/info/htrc-usergroup-l
- For questions that you want to discuss with us privately, please write to email@example.com, a list subscribed by HTRC internal staff only.
- All users are subscribed to a listserv called HTRC-Announce when they create an HTRC Analytics account. Only approved senders can send mail through this list.