Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


Excerpt

See the different ways you can download EF files to your local machine or inside a data capsule.


Table of Contents
maxLevel3

...

A sample of 100 extracted feature files is available for download through your browser: sample-EF202003.zip

Also, thematic collections are available to download: DocSouth_sample_EF202003.zip (82 volumes), EEBO_sample_EF202003.zip (234 volumes), ECCO_sample_EF202003.zip (412 volumes).

Filepaths

The dataset is stored in a directory specification created by HTRC called stubbytree. (Note: This is a change from previous versions of the Extracted Features dataset that used the pairtree directory specification.) Stubbytree places files in a directory structure based on the file name, with the highest level directory being the HathiTrust source code (i.e. volume ID prefix), and then using every third character of the cleaned volume ID, starting with the first, to create a sub-directory. For example the Extracted Features file for the volume with HathiTrust ID nyp.33433070251792 would be located at: 

...

The rsync module (or alias path) for Extracted Features 2.0 is data.analytics.hathitrust.org::features-2020.03/ . 

If you run the rsync command as written above, without specifying file paths, it will sync all files. Do not do this unless you are prepared to work with the full dataset, which is 4 TB. Make sure to include the final period (.) when running your command in order to sync the files to your current directory, or else provide the path to the local directory of your choosing where you would like the files to be synced to.

...

Code Block
rsync -azv data.analytics.hathitrust.org::features-2020.03/listing/htrc-ef-all-filesfile_listing.txt 


It is possible to sync any single Extracted Features file in the following manner:

...

A sample of 100 extracted feature files is available for download through your browser: sample-EF201801.zip.

Also, thematic collections are available to download: DocSouth_sample_EF201801.zip (87 volumes),EEBO_sample_EF201801.zip (355 volumes), ECCO ECCO_sample_EF201801.zip (505 volumes).

Filepaths

The data is stored in a pairtree directory structure, allowing you to infer the location of any file based on its HathiTrust volume identifier. Pairtree format is an efficient directory structure, which is important for HathiTrust-scale data, where the files are placed in directories based on “pairs” of characters in their file names. For example the Extracted Features file for the volume with HathiTrust ID mdp.39015073767769 would be located at: 

...

The Rsync module (or alias path) for Extracted Features 1.5 is data.analytics.hathitrust.org::features-2018.01/ . 

If you run the rsync command as written above, without specifying file paths, it will sync all files. Do not do this unless you are prepared to work with the full dataset, which is 4 TB. Make sure to include the final period (.) when running your command in order to sync the files to your current directory, or else provide the path to the local directory of your choosing where you would like the files to be synced to.

...

Code Block
rsync -azv data.analytics.hathitrust.org::features-2018.01/listing/htrc-ef-all-filesfile_listing.txt 


In order to rsync a file or set of files, you must know their directory path on HTRC’s servers. It is possible to sync any single Extracted Features file in the following manner:

...

Extracted Features 1.5 files can also be downloaded for search results in HTRC's beta Workset Builder 2.0. After completing your search, you can download either the Extracted Features files for your results set or for single files from your results.

...

A sample of 100 extracted feature files is available for download through your browser:sample-betaEF201801.zip.

Also, thematic collections are available to download: DocSouth_sample_EF201801.zip (87 volumes), EEBO_sample_EF201801.zip (355 volumes), ECCO_sample_EF201801.zip (505 volumes).

Filepaths

The data is stored in a pairtree directory structure, allowing you to infer the location of any file based on its HathiTrust volume identifier. Pairtree format is an efficient directory structure, which is important for HathiTrust-scale data, where the files are placed in directories based on “pairs” of characters in their file names. For example the Extracted Features file for the volume with HathiTrust ID mdp.39015073767769 would be located at: 

...

The Rsync module (or alias path) for Extracted Features .2 is data.analytics.hathitrust.org::features-2015.02

If you run the rsync command as written above, without specifying file paths, it will sync all files. Do not do this unless you are prepared to work with the full dataset, which is 1.2 TB. Make sure to include the final period (.) when running your command in order to sync the files to your current directory, or else provide the path to the local directory of your choosing where you would like the files to be synced to.

...

Code Block
rsync -azv data.analytics.hathitrust.org::features-2015.02/listing/htrc-ef-all-filesfile_listing.txt 


Users hoping for a more flexible file listing can use rsync's --list-only flag.

...


To download the Extracted Features data for a specific workset in HTRC Analytics, there is an algorithm that generates the Rsync download script, the Extracted Features Download Helper. The tool can also be useful if you don’t want to go through the process of determining files paths. Select version 0.2 when you run the algorithm to get the Extracted Features 0.2 version of the files.

The algorithm creates a shell script that you can download and run from your local command line. The file lists the rsync commands for every volume in an HTRC workset. Once you have run the algorithm and downloaded the resulting file, you will run the resulting .sh file.

...