Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

 

This tutorial was first developed for an Indiana University Scholar Commons event (see announcement), hosted on 9/15/2014 at IU Wells Library. Now it is used as a general tutorial for a hands-on session for HTRC Data Capsule tool. 

Short link for the page http://bit.ly/1whzT6H

Table of Contents

Preparation

First of all, register an account on HTRC portal on the development stack, from where you will access the HTRC Data Capsule. 

Expand
titleSign Up for an Account, and Sign In

Include Page
Sign Up and Sign In
Sign Up and Sign In

Install a VNC Client on your computer to enable the communication between your computer and the Virtual Machine (VM) to be created. You can choose any VNC client you prefer.

We use VNC View for Google Chrome in this tutorial so also recommend people install the same. Install and launch the app.

Expand
titleSoftware Installation

Include Page
Software Installation
Software Installation

Getting Familiar with the VM

Log in to the development portal where you just created an account and sign in. Create a VM (virtual machine) by clicking on the "HTRC Data Capsule" -> "Create Virtual Machine" on the top of the page. You will be assigned a VM after submitting the VM Creation page.

Expand
titleCreate a VM on HTRC Data Capsule

Include Page
Create a VM
Create a VM

Start the VM you were assigned by clicking on the "Start VM" button on the Virtual Machines list page (make sure you have logged in the portal in order to see the page). 

Expand
titleStart the VM

Include Page
Start the VM
Start the VM

After starting the VM, you can connect to and operate on the VM via the VNC Client you just installed. Use the "Host Name" and "VNC port" fields of the VM as input to the VNC Client: put them the "Address" field of the VNC Viewer, separated by a semicolon ;

Expand
titleInteract with the VM via VNC Client

Include Page
Interact with the VM via VNC Client
Interact with the VM via VNC Client

 

The VM is designed to have 2 modes: maintenance node and secure modes. Under the "Virtual Machines" page, click on "Switch to Secure Mode" or "Switch to Maintenance Mode" buttons to switch between modes.

Under maintenance mode, user is allowed to access network freely except for HTRC corpus repository and install whatever software she wants. In secure mode, network access is restricted. User is only  allowed to access a few network addresses e.g., HTRC corpus repository and search service.


Expand
titleRun Experiments, Release Results

Include Page
COM:Mode Switch: Maintenance and Secure Modes
COM:Mode Switch: Maintenance and Secure Modes


Run text analysis experiments in the VM. Details of conducting experiments are demonstrated in the 4 use cases below. If users want to export results out of the VM, they can release the result in the VM secure mode.

Expand
titleVM Mode Switch: Maintenance and Secure Modes

Include Page
Run Experiments, Release Results
Run Experiments, Release Results

Use Cases

We walk through 4 use cases on using HTRC corpus for text analytics within the HTRC Data Capsule VM. For demo participants' convenience, the VM you just requested has been pre-loaded with required R packages and the IPython tool, along with a volume ID list of the English Short Title collection. All these use cases are to be operated within the VM.

Since it's performed in VM, it will be helpful to open a browser in the VM, e.g. FireFox, and go to the url http://wiki.htrc.illinois.edu/pages/viewpage.action?pageId=22085965 or http://bit.ly/1whzT6H Then you can easily copy and paste the hyperlinks and the commands from the Wiki. 

 

HTRC provides a search engine API, Solr API, for scholars to search volumes of their interest. Scholars can search by full-text, or MARC catalog fields. An example query is http://chinkapin.pti.indiana.edu:9994/solr/meta/select/?q=title:war which returns all volumes of which the titles contain "war".

Expand
titleUse Case: Use Solr API to Retrieve Volume IDs

Include Page
COM:Use Case: Use Solr API to Retrieve Volume IDs
COM:Use Case: Use Solr API to Retrieve Volume IDs

Given a list of volume IDs supplied by users, the HTRC Feature API returns a Term-Document-Matrix (TDM) for the volumes. The matrix contains term frequency count information of each volume, which can be used for further statistical analysis. In this example, we use the English Short Title Catalog's volume ID list, to request its Term-Document-Matrix from the API.

Expand
titleUse Case: Use HTRC Feature API to Acquire Volume Features

Include Page
COM:Use Case: Use HTRC Feature API to Acquire Volume Features
COM:Use Case: Use HTRC Feature API to Acquire Volume Features

Using the returned Term-Document-Matrix, we run some R analysis and visually show some insights of the collection (English Short Title Catalog).

Expand
titleUse Case: Run R analysis on Derived Features from the Feature API

Include Page
COM:Use Case: Run R analysis on Derived Features from the Feature API
COM:Use Case: Run R analysis on Derived Features from the Feature API

Use the IPython interactive interface to fetch volume content, and then run vector space model and topic modeling on volumes' OCR content.

Expand
titleUse Case: Perform Text Analytics Using IPython

Include Page
COM:Use Case: Perform Text Analytics Using IPython
COM:Use Case: Perform Text Analytics Using IPython


 Finishing Up


Expand
titleFinishing Up Steps

Include Page
Finishing Up
Finishing Up

Preparation

First of all, register an account on HTRC portal on the development stack, from where you will access the HTRC Data Capsule. 

click the link

Install a VNC Client on your computer to enable the communication between your computer and the Virtual Machine (VM) to be created. You can choose any VNC client you prefer.

We use VNC View for Google Chrome in this tutorial so also recommend people install the same. Install and launch the app.

click the link

Getting Familiar with the VM

Log in to the development portal where you just created an account and sign in. Create a VM (virtual machine) by clicking on the "HTRC Data Capsule" -> "Create Virtual Machine" on the top of the page. You will be assigned a VM after submitting the VM Creation page.

click the link

Start the VM you were assigned by clicking on the "Start VM" button on the Virtual Machines list page (make sure you have logged in the portal in order to see the page). 

click the link

After starting the VM, you can connect to and operate on the VM via the VNC Client you just installed. Use the "Host Name" and "VNC port" fields of the VM as input to the VNC Client: put them the "Address" field of the VNC Viewer, separated by a semicolon ;

click the link

The VM is designed to have 2 modes: maintenance node and secure modes. Under the "Virtual Machines" page, click on "Switch to Secure Mode" or "Switch to Maintenance Mode" buttons to switch between modes.

Under maintenance mode, user is allowed to access network freely except for HTRC corpus repository and install whatever software she wants. In secure mode, network access is restricted. User is only  allowed to access a few network addresses e.g., HTRC corpus repository and search service.

click the link

Run text analysis experiments in the VM. Details of conducting experiments are demonstrated in the 4 use cases below. If users want to export results out of the VM, they can release the result in the VM secure mode.

click the link

Use Cases

We walk participants through 4 use cases on using HTRC corpus for text analytics within the HTRC Data Capsule VM. For demo participants' convenience, the VM you just requested has been pre-loaded with required R packages and the IPython tool, along with a volume ID list of the English Short Title collection. All these use cases are to be operated within the VM.

Since it's performed in VM, it will be helpful to open a browser in the VM, e.g. FireFox, and go to the url http://wiki.htrc.illinois.edu/pages/viewpage.action?pageId=22085965 or http://bit.ly/1whzT6H Then you can easily copy and paste the hyperlinks and the commands from the Wiki. 

HTRC provides a search engine API, Solr API, for scholars to search volumes of their interest. Scholars can search by full-text, or MARC catalog fields. An example query is http://chinkapin.pti.indiana.edu:9994/solr/meta/select/?q=title:war which returns all volumes of which the titles contain "war".

click the link

Given a list of volume IDs supplied by users, the HTRC Feature API returns a Term-Document-Matrix (TDM) for the volumes. The matrix contains term frequency count information of each volume, which can be used for further statistical analysis. In this example, we use the English Short Title Catalog's volume ID list, to request its Term-Document-Matrix from the API.

click the link

Using the returned Term-Document-Matrix, we run some R analysis and visually show some insights of the collection (English Short Title Catalog).

click the link

Use the IPython interactive interface to fetch volume content, and then run vector space model and topic modeling on volumes' OCR content.

click the link

Finishing Up

Upon completion of the hands-on, please perform these steps to back up your results, exit the VM, and shutdown the VM. The next time you sign in to the portal, you can restart the VM and continue working within the same environment.

Resources

See the Data Capsule User's Guide for more details about interacting with HTRC Data Capsule VM.