First of all, register an account on HTRC portal on the production stack, from where you will access the HTRC Data Capsule.
Install a VNC Client on your computer to enable the communication between your computer and the Virtual Machine (VM) to capsule, which is a virtual machine, to be created. You can choose any VNC client you prefer.
We use VNC View for Google Chrome in this tutorial so also recommend people install the same. Install and launch the app.
Getting Familiar with theVM
Log in to the HTRC portal where you just created an account and sign in. Create a VM (virtual machine) capsule by clicking on the "HTRC Data Capsule" -> "Create Virtual Machine" on the top of the page. You will be assigned a VM after submitting the VM Creation pageasked to provide information about the capsule you would like to create.
Start the VM you capsule you were assigned by clicking on the "Start VM" button on the Virtual Machines list page (make sure you have logged in the portal in order to see the page).
After starting the VMcapsule, you can connect to and operate on the VM capsule via the VNC Client you just installed. Use the "Host Name" and "VNC port" fields of the VM capsule as input to the VNC Client: put them the "Address" field of the VNC Viewer, separated by a semicolon ;.
- VM Mode Switch:
The VM Each capsule is designed to have 2 modes: maintenance node and secure modes. Under the "Virtual Machines" page, click on "Switch to Secure Mode" or "Switch to Maintenance Mode" buttons to switch between modes.
Under maintenance mode, user is allowed to access network freely except for HTRC corpus repository and install whatever software she wants. In secure mode, network access is restricted. User is only allowed to access a few network addresses e.g., HTRC corpus repository and search service.
Run text analysis experiments in the VMcapsule. Details of conducting experiments are demonstrated in the 4 use cases below. If users want to export results out of the VMcapsule, they can release the result in the VM capsule's secure mode.
We walk participants through 4 use cases on using HTRC corpus for text analytics within the HTRC Data Capsule VM. For demo participants' convenience, the VM capsule you just requested has been pre-loaded with required R packages and the IPython tool, along with a volume ID list of the English Short Title collection. All these use cases are to be operated within the VMcapsule.
Since it's performed in VMwithin the capsule's virtual machine environment, it will be helpful to open a browser in the VMcapsule, e.g. FireFoxFirefox, and go to the url http://wiki.htrc.illinois.edu/pages/viewpage.action?pageId=22085965 or http://bit.ly/1whzT6H Then you can easily copy and paste the hyperlinks and the commands from the Wiki.
HTRC provides a search engine API, Solr API, for scholars to search volumes of their interest. Scholars can search by full-text, or MARC catalog fields. An example query is
chinkapin.pti.indiana.edu:9994/solr/meta/select/?q=title:war which returns all volumes of which the titles contain "war".
Given a list of volume IDs supplied by users, the HTRC Feature API returns a Term-Document-Matrix (TDM) for the volumes. The matrix contains term frequency count information of each volume, which can be used for further statistical analysis. In this example, we use the English Short Title Catalog's volume ID list, to request its Term-Document-Matrix from the API.
Using the returned Term-Document-Matrix, we run some R analysis and visually show some insights of the collection (English Short Title Catalog).
Use the IPython interactive interface to fetch volume content, and then run vector space model and topic modeling on volumes' OCR content.
Upon completion of the hands-on, please perform these steps to back up your results, exit the VMcapsule, and shutdown the VMcapsule. The next time you sign in to the portal, you can restart the VM capsule and continue working within the same environment.