Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Update to reflect remove of basic/advanced split


Table of Contents



This documentation has been updated for the newest format of URLs for the Extracted Features dataset, intended for release in August 2016. This format no longer has basic and advanced features described in separate files. If you are looking for information on the earlier format, see revision 32 of this page.


In this page, we introduce the Extracted Features functionality (currently under beta release) that has recently been developed by the HathiTrust Research Center (HTRC). This functionality is one of the ways in which users of HTRC's tools can perform non-consumptive analysis of subsets of the HathiTrust Digital Library's corpus, that they have custom-selected by means of the workset mechanism available through the HTRC.


You can now see the status of the job, as shown below. The status of the job will initially show as “Staging”. (Refresh the screen after some time and you will see the status to have changed to “Queued”. ) 


2.5 Open the completed job


2.7 Download the script returned by the EF_Rsync_Script_Generator algorithm

 At this point, the script will be downloaded to your computer’s hard disk, and you will see the message at the bottom left of your browser window be replaced by just the name of the downloaded file:


If your workset contained N volumes with HathiTrust volume IDs V1, V2, V3,... VN respectively, then executing the shell script as shown above will cause the following compressed advanced and basic feature data files for the corresponding volumes to be transferred to your computer’s hard disk via rsync:

V1.advanced.json.bz2, V1.basic.json.bz2, 

V2.advanced V1.json.bz2, V2V2.basic.json.bz2, V3.advancedV3.json.bz2, V3 .basic.json.bz2, ... VN.advanced.VN.json.bz2, VN.basic.json.bz2

For the workset in this example, because it contained only one volume, the book Buch der Lieder by Heinrich Heine with the HathiTrust volumeID mdp.39015012864743, the script will transfer two files one file to your machine. They are the advanced and basic feature data files for the volume in the workset:mdp.39015012864743.advanced.json.bz2mdp.39015012864743.basic.json.bz2 


2.9 Uncompress the downloaded files

Because the advanced and basic feature data files will be downloaded in a are compressed format, you will may need to uncompress them into JSON-formatted text files.

You will now be able to view the files in the text editor of your choice, and perform text analysis with them using your own code, in the programming language(s) of your choice.

 , depending on your need. If you are using the files with the HTRC Feature Reader, the library will deal with the compression.