Overview

The HTRC Data Capsule is a secure computing environment developed to facilitate non-consumptive text analysis research. Each capsule is a virtual machine (VM) that provides researchers a desktop they can use to perform their investigation of volumes in the HathiTrust Digital Library. 

Further information, check out the FAQs

Follow a detailed tutorial

Using an HTRC Data Capsule

Maintenance vs. Secure Mode

The capsules are configured with special security settings that allow you to interact with them in two modes:  maintenance mode and secure mode

Administering a Capsule

Use the HTRC Portal interface to handle administrative tasks for your capsule:


Interacting with the Capsule Interface

Access and log-in to the interface for your capsule from your desktop using either a VNC client or through SSH. You can log in to a capsule using a VNC client if the capsule is in either maintenance or secure mode. You can only SSH into the capsule when it is in secure mode. 

Obtain the requisite information for the capsule under Capsule → Show Capsules → <capsule Id> 

  1. Access the capsule using a VNC client:
  2. Access the capsule via SSH:
ssh -p <your capsule port> dcuser@thatchpalm.pti.indiana.edu 
dcuser@thatchpalm.pti.indiana.edu's password: dcuser


Generic Research Workflow

  1. Create and start a capsule in the HTRC Portal

  2. Log into the capsule using a VNC client

  3. Configure the software environment of the capsule as needed. Download the scripts or programs you plan to use in your analysis

  4. Switch capsule to secure mode through HTRC Portal

  5. Run your against the secure HTRC corpus repository

  6. Move your results to the secure volume storage on the capsule

  7. Switch capsule back to maintenance mode to regain normal network access

HTRC Data Capsule Configurations

Capsule Technical Specifications

You can set several parameters for their capsule during the creation process

Pre-installed Software, Libraries, and Data

Each capsule comes pre-loaded with the following software, libraries, and data. For more information, consult the ReadMe file on the desktop of your capsule for more details about installed packages.

Software


NameVersionURLNote

Akka

2.4.14http://akka.io/
Anaconda 3 4.2.0https://www.continuum.io/anaconda-overviewSupports both Python 2.X and 3.X. See list below for the Python libraries pre-installed (some via Anaconda)

Ant

1.9.7http://ant.apache.org/

Hadoop

2.7.3http://hadoop.apache.org/

Mallet

2.0.8http://mallet.cs.umass.edu/

R

3.3.9https://www.r-project.org/

Sbt

0.13.13http://www.scala-sbt.org/

Scala

2.12.1https://www.scala-lang.org/

Spark

2.0.2http://spark.apache.org/


Python Libraries 

Sample data and programs

Non-consumptive Exports from an HTRC Data Capsule

Data and tools can easily enter a user's capsule, but anything leaving a capsule must undergo review prior to release to the user. The guidelines used during review of the outputs of a capsule are as follows: