HTRC Data Capsules are secure computing environments developed to facilitate non-consumptive text analysis research. Each Capsule is a virtual machine (VM) that provides researchers a desktop they can use to perform their investigation of volumes in the HathiTrust Digital Library. |
Configuration options for Research Capsules:
Research Data Capsules can be shared between up to 5 collaborators. The person who creates the Capsule has the most control over it, and they can add and remove other collaborators, assign permissions, and delete the Capsule.
There are 3 roles for users of a shared Capsule:
Once a collaborator is added to a Capsule, the Capsule will appear for them on their Capsules listing page in HTRC Analytics. Before the new collaborator can access the Capsule, they will need to agree to the Data Capsules Terms of Use.
For Capsules with full-corpus access, HTRC will review the request to add a collaborator and either approve or deny it. The Capsule details will only appear on their Capsules listing page if the request is approved.
As they are meant for short-term exploration, demo capsules cannot be shared with collaborators.
Each Capsule comes pre-loaded with the following software, libraries, and data. For more information, consult the ReadMe file on the desktop of your Capsule for more details about installed packages.
Name | Version | URL | Note |
---|---|---|---|
Akka | 2.4.14 | http://akka.io/ | |
Anaconda 3 | 4.2.0 | https://www.continuum.io/anaconda-overview | Supports both Python 2.X and 3.X. See list below for the Python libraries pre-installed (some via Anaconda) |
Ant | 1.9.7 | http://ant.apache.org/ | |
Hadoop | 2.7.3 | http://hadoop.apache.org/ | |
InPho Topic Explorer | https://inpho.github.io/topic-explorer/ | Project website: https://www.hypershelf.org/ | |
Mallet | 2.0.8 | http://mallet.cs.umass.edu/ | |
R | 3.3.9 | https://www.r-project.org/ | |
Sbt | 0.13.13 | http://www.scala-sbt.org/ | |
Scala | 2.12.1 | https://www.scala-lang.org/ | |
Spark | 2.0.2 | http://spark.apache.org/ | |
Voyant Tools | https://voyant-tools.org/ |
HTRC-developed
General
There is an overall disk quota, a memory quota, and a CPU quota for each user in the Data Capsule environment. One user can consume up to 100 GB of disk space, ~20 GB of memory, and 10 CPUs. If you attempt to create a second or third Capsule that exceeds your quota in one of the areas above, then you will encounter an error.
The HTRC Data Capsule service’s maximum capacity flexes depending on the size of the Capsules it hosts. In the event that the Data Capsule service cannot satisfy all simultaneous demands for Capsules:
A Capsule may be recalled (i.e. deleted) and the work environment will no longer exist.
Capsules will be identified for recall based on criteria such as date of last use and an individual’s resource usage, with the goal being to extend the number of individuals afforded the opportunity to conduct research using a Capsule
A researcher whose Capsule is identified for recall will be notified via email regarding the pending recall, and they will have 5 days to respond to the recall notification.
Priority in satisfying a new request for a Capsule will be given to researchers whose affiliated organization is a HathiTrust member.
At times when the Data Capsule service has reached capacity, incoming requests for Capsules will be screened based on institutional affiliation.
Instructors who intend to use the HTRC Data Capsules in a course should contact htrc-help@hathitrust.org so that proper arrangements can be made.
Users who do not abide by the Terms of Use will have their Capsule recalled immediately.
Use the HTRC site to handle administrative tasks for your Capsule:
Create - a Capsule is created, but it is not yet running
Start - turn the Capsule on in maintenance mode
Stop - shutdown a Capsule
Delete - the Capsule is deleted (including its data and settings)
Switch modes - change the Capsule from maintenance to secure mode, or vice-versa (see below)
See status - view your Capsules and their statuses
Interact - use your Capsule either through a desktop view or a terminal (command line) view
The Capsules are configured with special security settings that allow you to interact with them in two modes: maintenance mode and secure mode
Access your Capsule in-browser from HTRC Analytics either by viewing the Remote Desktop (both modes available) or the Terminal command line interface (Maintenance Mode only). Earlier versions of the Capsule environment required a VNC viewer and passwords for both the VNC and the Capsule's operating system; those requirements are removed in the web-based version that was implemented in August, 2018.
You can also SSH into your Capsule in Maintenance Mode only if you've followed the directions under "Advanced Features" to set-up a public key.
To operate your Capsule, click on the Capsule ID from the Capsule list page. Then choose to either view the remote desktop or the terminal. The terminal will work in Maintenance Mode only.
If you've established a key for SSH access, you can also SSH into your Capsule when it's in Maintenance Mode by using the command viewable under "Advanced Features" on an individual Capsule's status page.
Use the HTRC Workset Toolkit to import HathiTrust text data into your Capsule. Any outside data you plan to analyze in conjunction with HathiTrust data can be added to your Capsule from a web-accessible location when your machine is in Maintenance Mode.
Earlier versions of the Capsule environment required passwords for both the VNC and the Capsule's operating system; those requirements are removed in the web-based version that was implemented in August, 2018. If you use the included HTRC Workset Toolkit when in your Capsule to import data to your Capsule, you will be prompted for your HTRC Analytics username and password.
Create and start a Capsule in the HTRC
View your Capsule using the Remote Desktop view or Terminal view.
Configure the software environment of the Capsule as needed. Download the scripts or programs you plan to use in your analysis
Switch Capsule to secure mode through HTRC
Run your against the secure HTRC corpus repository
Move your results to the secure volume storage on the Capsule
Switch Capsule back to maintenance mode to regain normal network access
Data and tools can easily enter a user's Capsule, but anything leaving a Capsule must undergo review prior to release to the user. The guidelines used during review of the outputs of a Capsule are as follows:
The general rule-of-thumb is whether the export would create a substitute for human-reading the original text. (The full Non-Consumptive Use Research Policy is also available for your reference.) If you would like someone to pre-review a sample file that would represent the kinds of data you would like to export from a capsule before you begin your work, please contact htrc-help@hathitrust.org.
A release request must be under 67 MB, and any submitted requests over this size will fail due to technical limitations.
Release requests should include a README text file describing the files included in the request and their data structure.
If you have a directory of results files that you would like to export to be released, you can zip the directory and export the compressed file.
You will receive an email notifying you if your results export has been approved. The link for downloading results that have been approved for export will appear on the landing page for your capsule. Each approved request will be available for 2 weeks from the approval date. All collaborators on a capsule will get notification that approved results are ready, and will find the released results available to them on their capsule landing page.