Child pages
  • HTRC Data Capsule Specifications and Usage Guide

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


Info

This page details the operations and specifications of the HTRC Data Capsules. See the HTRC Data Capsule Tutorial for more a detailed, step-by-step tutorial for how to use you capsule.

Table of Contents
maxLevel1

...

The HTRC Data Capsule is a secure computing environment developed to facilitate non-consumptive text analysis research. Each capsule is a virtual machine (VM) that provides researchers a desktop they can use to perform their investigation of volumes in the HathiTrust Digital Library. 

Further information, check out the FAQs

Follow a detailed tutorial

Using an HTRC Data Capsule

Maintenance vs. Secure Mode

The capsules are configured with special security settings that allow you to interact with them in two modes:  maintenance mode and secure mode

  • In maintenance mode, you are allowed to access the network freely and install whatever software you want. 
  • In secure mode, general network access is restricted, but you can access the HTRC corpus repository, which is otherwise blocked. Any changes you make to the capsule in secure mode will not persist. To save data from your analysis, you'll need to save your results in the secure volume storage on your capsule. This storage option is not visible in maintenance mode. 

Administering a Capsule

Use the HTRC Portal interface to set-up and interact with handle administrative tasks for your capsule VM, including: 

  • Create - a capsule is created, but it is not yet running

  • Start - turn the capsule on in maintenance mode

  • Stop - shutdown a capsule

  • Delete - the capsule is deleted (including its data and settings) 

  • Switch modes - change the capsule from maintenance to secure mode, or vice-versa

  • See status - view your capsules and their statuses


Once a capsule is started via the HTRC Portal interface, you can access your capsule desktop using a VNC client. To run analysis using data from the HTRC corpus repository, you'll need to switch the capsule VM to secure mode. Here is a typical workflow a new user may follow:Interacting with the Capsule Interface

Access and log-in to the interface for your capsule from your desktop using either a VNC client or through SSH. You can log in to a capsule using a VNC client if the capsule is in either maintenance or secure mode. You can only SSH into the capsule when it is in secure mode. 

Obtain the requisite information for the capsule under Capsule → Show Capsules → <capsule Id> 

  1. Access the capsule using a VNC client:
    • Install a VNC client on your computer if you do not already have one. There a several you can choose from, including RealVNC, Chicken, and Screen Sharing (on a Mac). 
    • Open your VNC client and input the  VNC URL 
  2. Access the capsule via SSH:
    • From your Linux terminal, use the following command
Code Block
ssh -p <your capsule port> dcuser@thatchpalm.pti.indiana.edu 
dcuser@thatchpalm.pti.indiana.edu's password: dcuser


Generic Research Workflow

  1. Create and start a capsule in the HTRC Portal

  2. Log into the capsule using a VNC client

  3. Configure the software environment of the capsule as needed. Download the scripts or programs you plan to use in your analysis

  4. Switch capsule to secure mode through HTRC Portal

  5. Run your against the secure HTRC corpus repository

  6. Move your results to the secure volume storage on the capsule

  7. Switch capsule back to maintenance mode to regain normal network access

HTRC Data Capsule Configurations

Capsule Technical Specifications

You can set several parameters for their capsule during the creation process

  • Data Capsule Image: there are two images (versions) of the standard capsule desktop, one that comes pre-loaded with sample volumes from the HathiTrust and one that does not, for the researcher to choose between
  • VNC Login User Name: set by the researcher
  • VNC Login Password: set by the researcher
  • Virtual Machine CPUs (VCPUs): the number of virtual machine processors from 1-4 VCPUs for the capsule, set by the researcher
  • Memory: displayed in megabytes, between 2048 MB (2GB) and 16000 MB (16GB)

Pre-installed Software, Libraries, and Data

Each capsule comes pre-loaded with the following software, libraries, packages, and  and data. For more information, consult the ReadMe file on the desktop of your capsule for more details about installed packages.

Software


NameVersionURLNote

Akka

2.4.14http://akka.io/
Anaconda 3 4.2.0https://www.continuum.io/anaconda-overviewSupports both Python 2.X and 3.X. See list below for the Python libraries pre-installed (some via Anaconda)

Ant

1.9.7http://ant.apache.org/

Hadoop

2.7.3http://hadoop.apache.org/

Mallet

2.0.8http://mallet.cs.umass.edu/

R

3.3.9https://www.r-project.org/

Sbt

0.13.13http://www.scala-sbt.org/

Scala

2.12.1https://www.scala-lang.org/

Spark

2.0.2http://spark.apache.org/


Python Libraries 

  • csvkit
  • dask
  • GenSim (currently running with warning)
  • htrc-feature-reader
  • nltk
  • numpy
  • pandas
  • pytables
  • regex
  • scipy
  • theano
  • toolz
  • ujson

System-level Packages

  • Ant
  • curl
  • Git
  • GNU Parallel
  • grep
  • htop
  • iotop
  • Java 8
  • jq
  • less
  • Maven
  • pcregrep
  • R
  • rsync
  • Scala 2.11.6
  • SBT
  • Spark
  • vim
  • zsh

Anchor
sampledata
sampledata
Sample data and programs

  • 3 sample HTRC worksets of 1000 volumes each: U.S. Government Documents, German language volumes, 19th Century English Literature. 
  • Topic Explorer: http://inphodata.cogs.indiana.edu/

HTRC Data Capsule operations

Please note: You are required to log in to the HTRC Portal before you can perform these operations.

Create a capsule virtual machine (VM)

  • Navigate to the “Create Virtual Machine” tab and fill in the form.
    • Choose an image from the drop down list
    • Provide username and password for the VNC session
    • Choose how many CPU and memory you want your virtual machine to have
  • Click the "Create VM" button. The VM creation procedure usually takes about 1 minute to finish. You can refresh your screen to see if it has completed.

Image Removed

Show capsule VM status

  • Navigate to the “Virtual Machines” page
  • You can see all the capsule VMs associated with your account and the actions you can perform on them

Image Removed

  • Click on the VM id link to see more details about the capsule VM.
  • The “VM Initial Logging User ID” and “VM Initial Logging Password” are the username and password you use to log into the capsule VM 
    • NOTE: These are different from the ones you use to open your VNC session to the VM
  • The “Public IP” and “VNC port” are information you need to open a VNC session
  • You can also use “Public IP” and “SSH port” to log into VM through ssh but this is only allowed in maintenance mode.

 Image Removed

Start a capsule VM

  • Start a capsule VM on the “Virtual Machine” page. This operation usually takes 2-3 minutes, and you can refresh your screen to see if it has finished.
  • Once the capsule VM starts successfully, you can see more available operations, including switch, stop, and delete.

Image Removed

Log into a capsule VM

  • Use your preferred VNC client, e.g. Google VNC Client, to connect to the capsule VM by providing the VNC password and VM login username, as well as login password.

Switch modes of a capsule VM

  • By default, the capsule starts in maintenance mode where you can have network access. 
  • To switch to secure mode, click the “Switch to Security” button. 
  • Once you click to switch modes, the screen in your capsule if be frozen for a short time. Once the switch is complete, you'll be able to resume your work.
  • To switch from secure mode to maintenance mode, make sure you eject/unmount the secure volume before switching out of secure mode to ensure that any changes made to the secure volume are made permanent.
  • In the HTRC Portal, go to "HTRC Data Capsule -> Show Virtual Machines" for the page for switching between modes. In the maintenance mode, click on the "Switch To Secure Mode" button in the portal to switch to the secure mode.
  • In the secure mode, click on the "Switch To Maintenance Mode" button in the HTRC Portal to switch to the secure mode.

Image Removed

Stop a Virtual Machine

  • Stop a capsule VM by clicking the “Stop VM” button. After that, the capsule VM shutdowns and everything inside the VM is maintained.

Image Removed

Restart a Virtual Machine

  • Although we do not provide a reset button for you to restart the VM directly, you can always stop the VM and then start it again. This has the same effect of pushing a reset button on a machine.

Delete a Virtual Machine

  • You can delete a capsule VM by clicking the “Delete VM” button. After that, the capsule cleared, including its settings and any data on it.

...