Child pages
  • HTRC Data Capsule Specifications and Usage Guide

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...



Panel

...

HTRC Data

...

Capsules are secure computing

...

environments developed to facilitate non-consumptive text analysis research. Each

...

Capsule is a virtual machine (VM) that provides researchers a desktop they can use to perform their investigation of volumes in the HathiTrust Digital Library. 

Button Hyperlink
titleUse a Capsule
typeprimary
urlhttps://analytics.hathitrust.org/staticcapsules



Image Added

HTRC Data Capsule Configurations

Capsule Technical Specifications

Configuration options:

  • Data Capsule Image: there are two images (versions) of the standard Capsule desktop, one that comes pre-loaded with sample volumes from the HathiTrust and one that does not
  • Virtual Machine CPUs (VCPUs): the number of virtual machine processors from 2-4 VCPUs for the Capsule
  • Memory: between 4GB and 16GB

User Quotas

...

There is an overall disk quota, a memory quota, and a CPU quota for each user in the Data Capsule environment. One user can consume up to 100 GB of disk space, ~20 GB of memory, and 10 CPUs. If you attempt to create a second or third Capsule that exceeds your quota in one of the areas above, then you will encounter an error. 


Pre-installed Software, Libraries, and Data

Each Capsule comes pre-loaded with the following software, libraries, and data. For more information, consult the ReadMe file on the desktop of your Capsule for more details about installed packages.

Software


NameVersionURLNote

Akka

2.4.14http://akka.io/
Anaconda 3 4.2.0https://www.continuum.io/anaconda-overviewSupports both Python 2.X and 3.X. See list below for the Python libraries pre-installed (some via Anaconda)

Ant

1.9.7http://ant.apache.org/

Hadoop

2.7.3http://hadoop.apache.org/
InPho Topic Explorer
http://inphodata.cogs.indiana.edu/

Mallet

2.0.8http://mallet.cs.umass.edu/

R

3.3.9https://www.r-project.org/

Sbt

0.13.13http://www.scala-sbt.org/

Scala

2.12.1https://www.scala-lang.org/

Spark

2.0.2http://spark.apache.org/
Voyant Tools
https://voyant-tools.org/

Python Libraries 


HTRC-developed

  • htrc-feature-reader
    Button Hyperlink
    titleLearn more
    typestandard
    urlhttps://github.com/htrc/htrc-feature-reader

  • htrc workset toolkit
    Button Hyperlink
    titleLearn more
    typestandard
    urlHTRC Workset Toolkit

General

  • csvkit
  • dask
  • GenSim (currently running with warning)
  • nltk
  • numpy
  • pandas
  • pytables
  • regex
  • scipy
  • theano
  • toolz
  • ujson

Anchor
sampledata
sampledata
Sample data

  • 3 sample HTRC worksets of 1000 volumes each: U.S. Government Documents, German language volumes, 19th Century English Literature

Using an HTRC Data Capsule

...

Button Hyperlink
title

...

Follow a Tutorial
typestandard
urlHTRC Data Capsule Step-by-Step Guides

Administering a Capsule

Use the HTRC site to handle administrative tasks for

...

your Capsule:

  • Create - a

...

  • Capsule is created, but it is not yet running

  • Start - turn the

...

  • Capsule on in maintenance mode

  • Stop - shutdown a

...

  • Capsule

  • Delete - the

...

  • Capsule is deleted (including its data and settings) 

  • Switch modes - change the

...

  • Capsule from maintenance to secure mode, or vice-versa (see below)

  • See status - view your

...

  • Capsules and their statuses

  • Interact - use your

...

  • Capsule either through a desktop view or a terminal (command line) view

Maintenance vs. Secure Mode

The

...

Capsules are configured with special security settings that allow you to interact with them in two modes:  maintenance mode and secure mode

  • In maintenance mode, you are allowed to access the network freely and install whatever software you want. 
  • In secure mode, general network access is restricted, but you can access the HTRC corpus repository, which is otherwise blocked. Any changes you make to the

...

  • Capsule in secure mode will not persist. To save data from your analysis, you'll need to save your results in

...

  • the secure volume storage on your

...

  • Capsule. This storage option is not visible in maintenance mode. 

Interacting with

...

a Capsule

Access your

...

Capsule in-browser from HTRC Analytics either by viewing the Remote Desktop (both modes available) or the Terminal command line interface (Maintenance Mode only). Earlier versions of the

...

Capsule environment required a VNC viewer and passwords for both the VNC and the

...

Capsule's operating system; those requirements are removed in the web-based version that was implemented in August, 2018. You can also SSH into your

...

Capsule in Maintenance Mode only if you've followed the directions under "Advanced Features" to set-up a public key. 

To operate your

...

Capsule, click on the

...

Capsule ID from the

...

Capsule list page. Then choose to either view the remote desktop or the terminal. The terminal will work in Maintenance Mode only. 

If you've established a key for SSH access, you can also SSH into your

...

Capsule when it's in Maintenance Mode

...

by using the command viewable under "Advanced Features" on an individual

...

Capsule's status page.

Importing data to a Capsule

Use the HTRC Workset Toolkit to import HathiTrust text data into your Capsule. Any outside data you plan to analyze in conjunction with HathiTrust data can be added to your Capsule from a web-accessible location when your machine is in Maintenance Mode.

Button Hyperlink
titleLearn more
typestandard
urlHTRC Workset Toolkit

Passwords

Earlier versions of the

...

Capsule environment required passwords for both the VNC and the

...

Capsule's operating system; those requirements are removed in the web-based version that was implemented in August, 2018. If you use the included HTRC Workset Toolkit when in your

...

Capsule to import data to your

...

Capsule, you will be prompted for your HTRC Analytics username and password.

Generic Research Workflow

  1. Create and start a

...

  1. Capsule in the HTRC

  2. View your

...

  1. Capsule using the Remote Desktop view or Terminal view.

  2. Configure the software environment of the

...

  1. Capsule as needed. Download the scripts or programs you plan to use in your analysis

  2. Switch

...

  1. Capsule to secure mode through HTRC

  2. Run your against the secure HTRC corpus repository

  3. Move your results to

...

  1. the secure volume

...

  1.  storage on the

...

  1. Capsule

  2. Switch

...

  1. Capsule back to maintenance mode to regain normal network access

...

HTRC Data Capsule Configurations

Capsule Technical Specifications

You can set several parameters for their capsule during the creation process

  • Data Capsule Image: there are two images (versions) of the standard capsule desktop, one that comes pre-loaded with sample volumes from the HathiTrust and one that does not, for the researcher to choose between
  • Virtual Machine CPUs (VCPUs): the number of virtual machine processors from 2-4 VCPUs for the capsule, set by the researcher
  • Memory: displayed in megabytes, between 4GB and 16GB, set by the researcher

User Quotas

There is an overall disk quota, a memory quota, and a CPU quota for each user in the Data Capsule environment. One user can consume up to 100 GB of disk space, ~20 GB of memory, and 10 CPUs. If you attempt to create a second or third capsule that exceeds your quota in one of the areas above, then you will encounter an error. 

Pre-installed Software, Libraries, and Data

Each capsule comes pre-loaded with the following software, libraries, and data. For more information, consult the ReadMe file on the desktop of your capsule for more details about installed packages.

Software

...

Akka

...

Ant

...

Hadoop

...

Mallet

...

R

...

Sbt

...

Scala

...

Spark

...

Python Libraries 

  • csvkit
  • dask
  • GenSim (currently running with warning)
  • htrc-feature-reader
  • htrc workset toolkit
  • nltk
  • numpy
  • pandas
  • pytables
  • regex
  • scipy
  • theano
  • toolz
  • ujson

...

...



Anchor
nonconsumptiveexport
nonconsumptiveexport
Non-consumptive Exports from an HTRC Data Capsule

Data and tools can easily enter a user's

...

Capsule, but anything leaving a

...

Capsule must undergo review prior to release to the user. The guidelines used during review of the outputs of a

...

Capsule are as follows:

  • Files containing any OCR text or images of pages or volumes will be prohibited from leaving a

...

  • Capsule.  
  • Binary files are prohibited from leaving a

...

  • Capsule
  • Encrypted files are prohibited from leaving a

...

  • Capsule
  • PDF files or other format of file that contains images of OCR text or text images are prohibited from leaving a

...

  • Capsule
  • For any

...

  • Capsule results directory that exceeds 1 MB in total, the collection will not be released pending discussion with the

...

 

...

  • Capsule owner

Button Hyperlink
titleRead the policy
typestandard
urlhttps://www.hathitrust.org/htrc_ncup

A release request must be under 67 MB, and any submitted requests over this size will fail due to technical limitations.

Release requests should include a README text file describing the files included in the request and their data structure.