Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.

*HTRC Data Capsule is funded in part by a grant from the Alfred P. Sloan Foundation. Here we introduce HTRC Capsule v1.0 and its use for non-consumptive Analysis of HathiTrust repository.  Here we introduce the Extracted Features functionality (currently under beta release) that have recently been developed by the HathiTrust Research Center. This functionality is one of the ways in which users of the HathiTrust Research Center can perform non-consumptive analysis of subsets of the HathiTrust Digital Library's corpus that they have custom-selected by means of the workset mechanism available through the HTRC. (Currently, this functionality is available only for the HathiTrust Digital Library's public domain corpus, consisting of slightly less than 5 million volumes.)

Table of Contents

1. Overview

This document is a starting point for a user interested in downloading the json-format Extracted Features (EF) data files corresponding to the specific HathiTrust Digital Library volumes that constitute the user's custom workset (that the user has built with the HTRC Workset Builder). We will show, step by step, how you can create a custom workset and then how you can download the data files corresponding to the content of your workset. 

 2. Create your workset

This section shows you how to  create a custom workset consisting (, for the volume(s) contained in which you will eventually download the corresponding advanced and basic EF data files. Your workset can contain as many volumes as you wish. However, the example workset for this section will consist, for the sake of simplicity) , of a single volume from the HathiTrust Digital Library's public domain collection, : a published-in-1920 

2. Create a custom workset

This section covers all the operations you can make to the virtual machine. You are required to log in to the HTRC portal before you can perform these operations.


1920 edition of the book of poems titled Buch der Lieder by the German poet Heinrich Heine. Then we show you how you can download the EF data files corresponding to this workset. (One of the use cases for the EF approach to non-consumptive text analysis that we have posted also uses this particular book by Heine to make its point.)

2.1 Sign in to the HTRC 

Navigate to “Create Virtual Machine” tab and fill in the form. You need to choose an image from the drop down list, provide username and password for the VNC session, and choose how many CPU and memory you want your virtual machine has. Finally you hit the "Create VM" button. The VM creation procedure usually takes about 1 minute to finish.