Child pages
  • Getting started with the HTRC Data Capsule
Skip to end of metadata
Go to start of metadata

Overview

The HTRC Data Capsule is a secure computing environment developed to facilitate non-consumptive text analysis research. Each capsule is a virtual machine (VM) that provides researchers a desktop they can use to perform their investigation of volumes in the HathiTrust Digital Library. 

For further information, check out the FAQs

Using an HTRC Data Capsule

Detailed tutorial

For step-by-step instructions, follow a tutorial or user guide.

Administering a Capsule

Use the HTRC site to handle administrative tasks for your capsule:

  • Create - a capsule is created, but it is not yet running

  • Start - turn the capsule on in maintenance mode

  • Stop - shutdown a capsule

  • Delete - the capsule is deleted (including its data and settings) 

  • Switch modes - change the capsule from maintenance to secure mode, or vice-versa (see below)

  • See status - view your capsules and their statuses

Maintenance vs. Secure Mode

The capsules are configured with special security settings that allow you to interact with them in two modes:  maintenance mode and secure mode

  • In maintenance mode, you are allowed to access the network freely and install whatever software you want. 
  • In secure mode, general network access is restricted, but you can access the HTRC corpus repository, which is otherwise blocked. Any changes you make to the capsule in secure mode will not persist. To save data from your analysis, you'll need to save your results in the secure volume storage on your capsule. This storage option is not visible in maintenance mode. 


Interacting with the Capsule

Access and log-in to your capsule from your desktop using either a VNC client (Screen Sharing on a Mac, for example) or through SSH. You can log in to a capsule using a VNC client if the capsule is in either maintenance or secure mode. You can only SSH into the capsule when it is in secure mode. 

Obtain the requisite information for the capsule under Capsule → Show Capsules → <capsule Id> 

  1. Access the capsule using a VNC client:
    • Install a VNC client on your computer if you do not already have one. There a several you can choose from, including RealVNC, Chicken, and Screen Sharing (on a Mac). 
    • Open your VNC client and input the  VNC URL 
  2. Access the capsule via SSH:
    • From your Linux terminal, use the following command
ssh -p <your capsule port> dcuser@thatchpalm.pti.indiana.edu 
dcuser@thatchpalm.pti.indiana.edu's password: dcuser

Passwords

There are 3 passwords important for using a capsule: 

  • The HTRC Analytics username and password that you use to log-in to analytics.hathitrust.org. The web interface is where you switch your capsule's mode and start, stop, and shutdown your capsule.
  • The VNC Username and VNC password you set when you created your capsule. You will be prompted for this password when you log-in to your capsule using the VNC client or SSH. 
  • The generic Ubuntu username and password, which is dcuser for both. You will need to enter this password to access the operating system once your capsule is started in the VNC client. 

Generic Research Workflow

  1. Create and start a capsule in the HTRC

  2. Log into the capsule using a VNC client

  3. Configure the software environment of the capsule as needed. Download the scripts or programs you plan to use in your analysis

  4. Switch capsule to secure mode through HTRC

  5. Run your against the secure HTRC corpus repository

  6. Move your results to the secure volume storage on the capsule

  7. Switch capsule back to maintenance mode to regain normal network access

HTRC Data Capsule Configurations

Capsule Technical Specifications

You can set several parameters for their capsule during the creation process

  • Data Capsule Image: there are two images (versions) of the standard capsule desktop, one that comes pre-loaded with sample volumes from the HathiTrust and one that does not, for the researcher to choose between
  • VNC Login User Name: set by the researcher
  • VNC Login Password: set by the researcher
  • Virtual Machine CPUs (VCPUs): the number of virtual machine processors from 2-4 VCPUs for the capsule, set by the researcher
  • Memory: displayed in megabytes, between 4096 MB (4GB) and 16000 MB (16GB)

User Quotas

There is an overall disk quota, a memory quota, and a CPU quota for each user in the Data Capsule environment. One user can consume up to 100 GB of disk space, ~20 GB of memory, and 10 CPUs. If you attempt to create a second or third capsule that exceeds your quota in one of the areas above, then you will encounter an error. 

Pre-installed Software, Libraries, and Data

Each capsule comes pre-loaded with the following software, libraries, and data. For more information, consult the ReadMe file on the desktop of your capsule for more details about installed packages.

Software

NameVersionURLNote

Akka

2.4.14http://akka.io/
Anaconda 3 4.2.0https://www.continuum.io/anaconda-overviewSupports both Python 2.X and 3.X. See list below for the Python libraries pre-installed (some via Anaconda)

Ant

1.9.7http://ant.apache.org/

Hadoop

2.7.3http://hadoop.apache.org/

Mallet

2.0.8http://mallet.cs.umass.edu/

R

3.3.9https://www.r-project.org/

Sbt

0.13.13http://www.scala-sbt.org/

Scala

2.12.1https://www.scala-lang.org/

Spark

2.0.2http://spark.apache.org/

Python Libraries 

  • csvkit
  • dask
  • GenSim (currently running with warning)
  • htrc-feature-reader
  • nltk
  • numpy
  • pandas
  • pytables
  • regex
  • scipy
  • theano
  • toolz
  • ujson

Sample data and programs

  • 3 sample HTRC worksets of 1000 volumes each: U.S. Government Documents, German language volumes, 19th Century English Literature. 
  • Topic Explorer: http://inphodata.cogs.indiana.edu/

Non-consumptive Exports from an HTRC Data Capsule

Data and tools can easily enter a user's capsule, but anything leaving a capsule must undergo review prior to release to the user. The guidelines used during review of the outputs of a capsule are as follows:

  • Files containing any OCR text or images of pages or volumes will be prohibited from leaving a capsule.  
  • Binary files are prohibited from leaving a capsule
  • Encrypted files are prohibited from leaving a capsule
  • PDF files or other format of file that contains images of OCR text or text images are prohibited from leaving a capsule
  • For any capsule results directory that exceeds 1 MB in total, the collection will not be released pending discussion with the capsule owner

 

 

  • No labels