Child pages
  • HTRC Data Capsule Specifications and Usage Guide

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Here is a page on basic documentations of data capsule VMs. Such operations include create a VM, start a VM, stop a VM, delete a VM, switch between modes, query VM status. See HTRC Data Capsule Tutorial (current version) for more a detailed, step-by-step tutorial for the data capsule.

Table of Contents

1. Overview

This document is a starting point for users working with our non consumptive virtual machine (VM). A non consumptive VM has two modes, i.e., maintenance mode and secure mode. In maintenance mode, user is allowed to access network freely except for HTRC corpus repository and install whatever software she wants. In secure mode, network access is restricted. Users are only allowed to access a few network addresses e.g., HTRC corpus repository. In addition, any changes user makes to the OS in secure mode will not be persisted. To save data, user needs to write to a specified storage called secure volume. The secure volume is invisible in maintenance mode.

We provide a user friendly web interface that users can interact with to manipulate VM running on our back end infrastructure. User can have following VM operations:

  • Create a VM. A virtual machine is created but its power is off.

  • Start a VM. The virtual machine starts up to maintenance mode.

  • Stop a VM. The virtual machine shutdowns.

  • Delete a VM. The virtual machine is deleted. Everything relative to this virtual machine is gone.

  • Switch a VM from maintenance mode to secure mode or vice versa.

  • Query VM status.

Once a VM is started, users can log into VM through a VNC client. To run analysis against HTRC OCR repository, users need to switch the VM to secure mode. Below is a typical workflow a new user may follow.

  1. Create a new VM;

  2. Start the new VM;

  3. Log into the VM;

  4. Configure the software environment as needed. Upload the analysis program to the VM;

  5. Switch VM to secure mode through web interface;

  6. Run the analysis program against HTRC corpus repository;

  7. Switch VM back to maintenance mode to regain normal network access.

2. Virtual Machine Operations

This section covers all the operations you can make to the virtual machine. You are required to log in to the HTRC portal before you can perform these operations.

2.1 Create a Virtual Machine

Navigate to “Create Virtual Machine” tab and fill in the form. You need to choose an image from the drop down list, provide username and password for the VNC session, and choose how many CPU and memory you want your virtual machine has. Finally you hit the "Create VM" button. The VM creation procedure usually takes about 1 minute to finish.

Image Removed

2.2 Show a Virtual Machine Status

Navigate to “Virtual Machines” page, you can see all the VMs and available operations associated with the VM.

Image Removed

You can click on the vm id link to see more details about the VM. The “VM Initial Logging User ID” and “VM Initial Logging Password” are the username and password you use to log into the VM. These are different from the ones you use to open your VNC session to the VM. The “Public IP” and “VNC port” are information you need to open a VNC session. You can also use “Public IP” and “SSH port” to log into VM through ssh but this is only allowed in maintenance mode.

 Image Removed

2.3 Start a Virtual Machine

You can start a virtual machine in the “Virtual Machine” page. This operation usually takes 2 ~ 3 minutes. Once the VM starts successfully, you can see more available operations e.g., switch, stop, and delete.

Image Removed

2.4 Log into a Virtual Machine

You can use your favorite VNC client, e.g. Google VNC Client, to connect to the VM by providing the VNC password and VM login username as well as login password.

2.5 Switch Virtual Machine Mode

By default, VM starts in maintenance mode where you can have network access. To switch to secure mode, you can hit the “Switch to Security” button. Once you perform the mode switch, within the VNC session, your screen is frozen in a short time. After that, you can resume your work. To switch from secure mode to maintenance mode, make sure you eject/unmount the secure volume before switching out of secure mode to ensure that any changes made to the secure volume are made permanent.

In the portal, go to "HTRC Data Capsule -> Show Virtual Machines" for the page for switching between modes. In the maintenance mode, click on the "Switch To Secure Mode" button in the portal to switch to the secure mode.

Image Removed

In the secure mode, click on the "Switch To Maintenance Mode" button in the portal to switch to the secure mode.

2.6 Stop a Virtual Machine

You can stop a VM by pushing the “Stop VM” button. After that, the VM shutdowns and everything inside the VM remains.

Image Removed

2.7 Restart a Virtual Machine

Although we do not provide a reset button for you to restart the VM directly, you can always stop the VM and then start it again. This has the same effect of pushing a reset button on a machine.

2.8 Delete a Virtual Machine

You can delete a VM by pushing the “Delete VM” button. After that, the VM is wiped out and everything inside the VM is gone.

Image Removed

3. Resources

1) The source code

The code base has 3 parts, a web GUI, web service and backend scripts

2) The web interface

You can find the url for the web front end from the HTRC production portal https://htrc2.pti.indiana.edu

...




Panel

HTRC Data Capsules are secure computing environments developed to facilitate non-consumptive text analysis research. Each Capsule is a virtual machine (VM) that provides researchers a desktop they can use to perform their investigation of volumes in the HathiTrust Digital Library. 


Button Hyperlink
titleUse a Capsule
typeprimary
urlhttps://analytics.hathitrust.org/staticcapsules
Button Hyperlink
titleFollow a tutorial
typestandard
urlHTRC Data Capsule Step-by-Step Tutorials


Image Added

Kinds of Capsules: Demo Capsules and Research Capsules

During creation, choose between a Demo Capsule, for testing and experimenting with the interface, or a Research Capsule, for conducting research. 

Demo
  • Capsule comes pre-loaded with sample volumes from the HathiTrust 
  • No options for Capsule size or specs
  • Access to public domain corpus, only
  • Results cannot be submitted for review to release
  • No additional information required to create
  • Cannot be shared with collaborators
  • Expires after 30 days
Research
  • Option for Capsule to come pre-loaded with sample volumes from the HathiTrust 
  • User can set the Capsule size (see 'Configuration options for Research Capsules' below)
  • By default, access to public domain corpus only
  • All Research Capsules require additional information to create in order to aid in results export requests. Only the requests to create or convert to a capsule with full corpus access are subject to additional screening (as described above).
  • Can be shared with up to 5 collaborators.
  • Members-only Benefit: full corpus access for the Data Capsule service. Existing Data Capsule users from HathiTrust member institutions or new Data Capsule requesters from member institutions have the exclusive option to select “Full Corpus Access,” which includes copyrighted items.

  • Expires 18 months from your last log-in date
  • Configuration options for Research Capsules:
    • Data Capsule Image: there are two images (versions) of the standard Capsule desktop, one that comes pre-loaded with sample volumes from the HathiTrust and one that does not
    • Virtual Machine CPUs (VCPUs): the number of virtual machine processors from 2-4 VCPUs for the Capsule
    • Memory: between 4GB and 16GB

(info) Watch videos about creating Demo and Research Capsules here.





Capsule Access and Interaction Modes

Maintenance vs. Secure Mode

The Capsules are configured with special security settings that allow you to interact with them in two modes:  maintenance mode and secure mode

  • In maintenance mode, you are allowed to access the network freely and install whatever software you want. 
  • In secure mode, general network access is restricted, but you can access the HTRC corpus repository, which is otherwise blocked. Any changes you make to the Capsule in secure mode will not persist. To save data from your analysis, you'll need to save your results in the secure volume storage on your Capsule. This storage option is not visible in maintenance mode.  

(info) Watch a video on switching modes in a Data Capsule here.

Ways of accessing Capsules

Access your Capsule in-browser from HTRC Analytics either by viewing the Remote Desktop (both modes available) or the Terminal command line interface (Maintenance Mode only). Earlier versions of the Capsule environment required a VNC viewer and passwords for both the VNC and the Capsule's operating system; those requirements are removed in the web-based version that was implemented in August, 2018.

You can also SSH into your Capsule in Maintenance Mode only if you've followed the directions under "Advanced Features" to set-up a public key. 

For a detailed explanation for how to create, access, and operate your Capsule, please visit the "HTRC Data Capsule Tutorial" page for step-by-step instructions. 


Capsule Sharing Functions

Research Data Capsules can be shared between up to 5 collaborators. The person who creates the Capsule has the most control over it, and they can add and remove other collaborators, assign permissions, and delete the Capsule.

There are 3 roles for users of a shared Capsule:

  • Owner (and Owner-Controller): By default each Capsule creator will get this role. It comes with the highest level of control. The Owner-Controller is able to perform all Capsule functions available in HTRC Analytics, including accessing, starting, stopping, switching modes, deleting, and managing the collaborators on the Capsule. By default the person who creates the Capsule will be the Owner-Controller until they delegate control of the Capsule to a collaborator, at which point their role becomes Owner and the ability to start, stop, and switch the modes of the Capsule moves to the Controller (see below). The Owner can resume Owner-Controller status whenever they choose.
  • Contributor: The Owner can share their Capsule with other HTRC Analytics users. New collaborators have the role of Contributor when they are added. This role has the lowest permission level. Contributors can connect to and conduct research in the Capsule, but cannot perform any of the Capsule management functions.
  • Controller: The Owner can choose to give a Contributor the status of Controller in order to delegate some management tasks of the Capsule to that user, including starting, stopping, and switching modes. There can only be one Controller at a time, and the Owner can revoke control of the Capsule at any time.

Once a collaborator is added to a Capsule, the Capsule will appear for them on their Capsules listing page in HTRC Analytics. Before the new collaborator can access the Capsule, they will need to agree to the Data Capsules Terms of Use.

For Capsules with full-corpus access, HTRC will review the request to add a collaborator and either approve or deny it. The Capsule details will only appear on their Capsules listing page if the request is approved.

As they are meant for short-term exploration, demo capsules cannot be shared with collaborators.

(info) Watch a video on adding collaborators to a Data Capsule here.


Requesting Help from your Data Capsule

To receive HTRC staff help on any kind of problem you are experiencing with your data capsule, you may submit a 'Request Data Capsule Help' form directly from that data capsule you are working in. Find the link to this form in the 'More Data Capsule Functions' dropdown menu located on the right-hand side of your data capsule's status page. 


Pre-installed Software, Libraries, and Data

Each Capsule comes pre-loaded with the following software, libraries, and data. For more information, consult the ReadMe file on the desktop of your Capsule for more details about installed packages.

Software

NameVersionNote

Akka

2.4.14
Anaconda 3 4.2.0Supports both Python 2.X and 3.X. See list below for the Python libraries pre-installed (some via Anaconda)

Ant

1.9.7

Hadoop

2.7.3
InPho Topic Explorer
Project website: https://www.hypershelf.org/

Mallet

2.0.8

R

3.3.9

Sbt

0.13.13

Scala

2.12.1

Spark

2.0.2
Voyant Tools

Python Libraries

HTRC-developed

  • htrc-feature-reader
    Button Hyperlink
    titleLearn more
    typestandard
    urlhttps://github.com/htrc/htrc-feature-reader

  • htrc workset toolkit
    Button Hyperlink
    titleLearn more
    typestandard
    urlHTRC Workset Toolkit

General

  • csvkit
  • dask
  • GenSim (currently running with warning)
  • nltk
  • numpy
  • pandas
  • pytables
  • regex
  • scipy
  • theano
  • toolz
  • ujson

Anchor
sampledata
sampledata
Sample data

  • 3 sample HTRC worksets of 1000 volumes each: U.S. Government Documents, German language volumes, 19th Century English Literature. 


Data Imports and Exports

Importing data into an HTRC Data Capsule

Use the HTRC Workset Toolkit to import HathiTrust text data into your Capsule. Any outside data you plan to analyze in conjunction with HathiTrust data can be added to your Capsule from a web-accessible location when your machine is in Maintenance Mode.

Button Hyperlink
titleLearn more
typestandard
urlHTRC Workset Toolkit

Anchor
nonconsumptiveexport
nonconsumptiveexport
Non-consumptive Exports from an HTRC Data Capsule

Data and tools can easily enter a user's Capsule, but anything leaving a Capsule must undergo review prior to release to the user. The guidelines used during review of the outputs of a Capsule are as follows:

  • Files containing any OCR text or images of pages or volumes will be prohibited from leaving a Capsule.  
  • Binary files are prohibited from leaving a Capsule
  • Encrypted files are prohibited from leaving a Capsule
  • PDF files or other format of file that contains images of OCR text or text images are prohibited from leaving a Capsule
  • For any Capsule results directory that exceeds 1 MB in total, the collection will not be released pending discussion with the Capsule owner

Button Hyperlink
titleRead the policy
typestandard
urlhttps://www.hathitrust.org/htrc_ncup

The general rule-of-thumb is whether the export would create a substitute for human-reading the original text. (The full Non-Consumptive Use Research Policy is also available for your reference.) If you would like someone to pre-review a sample file that would represent the kinds of data you would like to export from a capsule before you begin your work, please contact htrc-help@hathitrust.org. 

A release request must be under 67 MB, and any submitted requests over this size will fail due to technical limitations.

Release requests should include a README text file describing the files included in the request and their data structure.

If you have a directory of results files that you would like to export to be released, you can zip the directory and export the compressed file. 

You will receive an email notifying you if your results export has been approved. The link for downloading results that have been approved for export will appear on the landing page for your capsule.  Each approved request will be available for 2 weeks from the approval date. All collaborators on a capsule will get notification that approved results are ready, and will find the released results available to them on their capsule landing page.


User Quotas

There is an overall disk quota, a memory quota, and a CPU quota for each user in the Data Capsule environment. One user can consume up to 100 GB of disk space, ~20 GB of memory, and 10 CPUs. If you attempt to create a second or third Capsule that exceeds your quota in one of the areas above, then you will encounter an error. 

Anchor
recall policy
recall policy
Capsule Recall Practices

The HTRC Data Capsule service’s maximum capacity flexes depending on the size of the Capsules it hosts. In the event that the Data Capsule service cannot satisfy all simultaneous demands for Capsules:

  • A Capsule may be recalled (i.e. deleted) and the work environment will no longer exist.

    • Capsules will be identified for recall based on criteria such as date of last use and an individual’s resource usage, with the goal being to extend the number of individuals afforded the opportunity to conduct research using a Capsule

    • A researcher whose Capsule is identified for recall will be notified via email regarding the pending recall, and they will have 5 days to respond to the recall notification.

  • Priority in satisfying a new request for a Capsule will be given to researchers whose affiliated organization is a HathiTrust member.

    • At times when the Data Capsule service has reached capacity, incoming requests for Capsules will be screened based on institutional affiliation.

  • Instructors who intend to use the HTRC Data Capsules in a course should contact htrc-help@hathitrust.org so that proper arrangements can be made.

  • Users who do not abide by the Terms of Use will have their Capsule recalled immediately.