Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 31 Next »

 

This tutorial was first developed for an Indiana University Scholar Commons event (see announcement), hosted on 9/15/2014 at IU Wells Library. Now it is used as a general tutorial for a hands-on session for HTRC Data Capsule tool. 

Short link for the page http://bit.ly/1whzT6H

Preparation

First of all, register an account on HTRC portal on the development stack, from where you will access the HTRC Data Capsule. 

Sign Up




  • On the Sign Up page, enter requested information together with username you intend to use and password. The password must meet these requirements:                                    
    • Password must be more than 15 characters long.
    • Password must contain characters from three of the following five categories:
      • Uppercase characters of European languages (A through Z, with diacritic marks, Greek and Cyrillic characters)
      • Lowercase characters of European languages (a through z, sharp-s, with diacritic marks, Greek and Cyrillic characters)
      • Base 10 digits (0 through 9)
      • Nonalphanumeric characters
      • Any Unicode character that is categorized as an alphabetic character but is not uppercase or lowercase. This includes Unicode characters from Asian languages.
    • Password must not contain any white spaces.
    • Password must not contain your user ID.

Trouble shooting: if you can't successfully sign up

You need to have an academic email to sign up. We maintain a growing list of allowed email domains, e.g. emails ending with .edu or edu.ac or edu.tw. I you find you can't successfully sign up with your academic email then it's probably your email domain is not on the list. In this situation, please submit an account request by clicking on the request an account button within the error message on the page.

Sign In


  • You will receive an email in the email that you registered. Go to your email box, and follow the activation link in the email to activate your account.
  • Now you have created an HTRC Analytics account, you can sign in. On the top right of the page, sign in with your username and password.

 



Install a VNC Client on your computer to enable the communication between your computer and the Virtual Machine (VM) to be created. You can choose any VNC client you prefer.

We use VNC View for Google Chrome in this tutorial so also recommend people install the same. Install and launch the app.

Unable to render {include} The included page could not be found.

Getting Familiar with the VM

Log in to the development portal where you just created an account and sign in. Create a VM (virtual machine) by clicking on the "HTRC Data Capsule" -> "Create Virtual Machine" on the top of the page. You will be assigned a VM after submitting the VM Creation page.

Unable to render {include} The included page could not be found.

Start the VM you were assigned by clicking on the "Start VM" button on the Virtual Machines list page (make sure you have logged in the portal in order to see the page). 

On the Capsules page (found under Capsule on the top menu), click on the Start Capsule button.



 

After starting the VM, you can connect to and operate on the VM via the VNC Client you just installed. Use the "Host Name" and "VNC port" fields of the VM as input to the VNC Client: put them the "Address" field of the VNC Viewer, separated by a semicolon ;

Unable to render {include} The included page could not be found.

 

The VM is designed to have 2 modes: maintenance node and secure modes. Under the "Virtual Machines" page, click on "Switch to Secure Mode" or "Switch to Maintenance Mode" buttons to switch between modes.

Under maintenance mode, user is allowed to access network freely except for HTRC corpus repository and install whatever software she wants. In secure mode, network access is restricted. User is only  allowed to access a few network addresses e.g., HTRC corpus repository and search service.


Each capsule is designed to have 2 modes: maintenance mode and secure mode.
  • In maintenance mode, the user is allowed to access network freely except for HTRC corpus repository and install whatever software they wants. 
  • In secure mode, network access is restricted. The user is only allowed to access a few network addresses e.g., HTRC corpus repository and search service. 

Any changes user makes to their capsule in secure mode, such as data they download or create during their analysis, will not persist. To save data, you will need to save your data to a special storage area on your capsule called secure volume. The secure volume is invisible in maintenance mode. Follow the further steps in this tutorial to learn how to preserve your capsule between research sessions.

To switch modes, on the Capsules page, click on the Switch to Secure Mode or Switch to Maintenance Mode buttons to switch to the other mode, as shown below. You can practice switching modes, but you'll need your capsule in maintenance mode to follow the rest of the tutorial.



 


Run text analysis experiments in the VM. Details of conducting experiments are demonstrated in the 4 use cases below. If users want to export results out of the VM, they can release the result in the VM secure mode.

Run

Run text analysis experiments in the capsule. Details of conducting experiments are demonstrated in the 4 use cases.

Release Results

If users want to export results out of the capsule, they can release the result produced in the capsule's secure mode.

First, we demonstrate releasing a pre-existing file, i.e. a dictionary file in the Ubuntu OS. This does not involve any text analysis and results but just tries to demonstrate how we can release a file from the secure mode.

Make sure the capsule is in the secure mode.

Open a terminal in the capsule, add the following file to the release list by inputting the command below

releaseresults add /usr/share/dict/american-english

Then in the terminal type this command to release the file

releaseresults done

Release Text Analysis Results

The above demonstrates release a pre-existing file in the OS from the secure mode, and here is a demo of releasing results after running some text analysis algorithms. We need to finish running the Use Case: Run R analysis on Derived Features from the Feature API [obsolete] , which will generate a PDF file with the name of "Rplots.pdf", and we release this file out of the secure mode as below.

First, switch the capsule to secure mode. 

Second, open a terminal in the capsule, navigate to the secure volume by typing:

cd ../../media/secure_volume

Suppose the file you'd like to release is at /home/dcuser/demo/r/Rplots.pdf

You can prepare the result data for release by adding it, which is done by typing the command:   

releaseresults add /home/dcuser/demo/r/Rplots.pdf

Repeat using this command if you have other files to add.

Finally, to complete the release of your data, type: 

releaseresults done

Below shows the release commands in the Terminal in the capsule:

 

Getting Results

After releasing the results in the capsule, the result files will go through an HTRC internal inspection that will take up to 48 hours. If your results pass the inspection, then you will receive an email (to the one you used when signing up your Portal account) from HTRC with a link to access them. 

You can click on the link in the email and download your results.  


Files for Release Testing

Other than the English dictionary (/usr/share/dict/american-english) as shown above in the "Release Result" section, you can also try releasing these public files from Data Capsule, as an exercise (i.e. pretending they are the analysis results from Data Capsule). They are dump files of different sizes from Wikipedia.

https://dumps.wikimedia.org/enwiki/20150602/enwiki-20150602-pages-meta-history2.xml-p000017910p000019514.7z   (154.8 MB)

https://dumps.wikimedia.org/enwiki/20150602/enwiki-20150602-pages-meta-history11.xml-p001580642p001659969.7z  (500.7 MB)

https://dumps.wikimedia.org/enwiki/20150602/enwiki-20150602-pages-meta-history27.xml-p031557579p031934316.7z (1.1 GB)

You will need to download these data sets in the maintenance mode in Data Capsule, then switch to secure mode, and release them one by one in secure mode.


Use Cases

We walk through 4 use cases on using HTRC corpus for text analytics within the HTRC Data Capsule VM. For demo participants' convenience, the VM you just requested has been pre-loaded with required R packages and the IPython tool, along with a volume ID list of the English Short Title collection. All these use cases are to be operated within the VM.

Since it's performed in VM, it will be helpful to open a browser in the VM, e.g. FireFox, and go to the url http://wiki.htrc.illinois.edu/pages/viewpage.action?pageId=22085965 or http://bit.ly/1whzT6H Then you can easily copy and paste the hyperlinks and the commands from the Wiki. 

 

HTRC provides a search engine API, Solr API, for scholars to search volumes of their interest. Scholars can search by full-text, or MARC catalog fields. An example query is http://chinkapin.pti.indiana.edu:9994/solr/meta/select/?q=title:war which returns all volumes of which the titles contain "war".

Unable to render {include} The included page could not be found.

Given a list of volume IDs supplied by users, the HTRC Feature API returns a Term-Document-Matrix (TDM) for the volumes. The matrix contains term frequency count information of each volume, which can be used for further statistical analysis. In this example, we use the English Short Title Catalog's volume ID list, to request its Term-Document-Matrix from the API.

Unable to render {include} The included page could not be found.

Using the returned Term-Document-Matrix, we run some R analysis and visually show some insights of the collection (English Short Title Catalog).

Unable to render {include} The included page could not be found.

Use the IPython interactive interface to fetch volume content, and then run vector space model and topic modeling on volumes' OCR content.

Unable to render {include} The included page could not be found.


 

Preparation

First of all, register an account on HTRC portal on the development stack, from where you will access the HTRC Data Capsule. 

click the link

Install a VNC Client on your computer to enable the communication between your computer and the Virtual Machine (VM) to be created. You can choose any VNC client you prefer.

We use VNC View for Google Chrome in this tutorial so also recommend people install the same. Install and launch the app.

click the link

Getting Familiar with the VM

Log in to the development portal where you just created an account and sign in. Create a VM (virtual machine) by clicking on the "HTRC Data Capsule" -> "Create Virtual Machine" on the top of the page. You will be assigned a VM after submitting the VM Creation page.

click the link

Start the VM you were assigned by clicking on the "Start VM" button on the Virtual Machines list page (make sure you have logged in the portal in order to see the page). 

click the link

After starting the VM, you can connect to and operate on the VM via the VNC Client you just installed. Use the "Host Name" and "VNC port" fields of the VM as input to the VNC Client: put them the "Address" field of the VNC Viewer, separated by a semicolon ;

click the link

The VM is designed to have 2 modes: maintenance node and secure modes. Under the "Virtual Machines" page, click on "Switch to Secure Mode" or "Switch to Maintenance Mode" buttons to switch between modes.

Under maintenance mode, user is allowed to access network freely except for HTRC corpus repository and install whatever software she wants. In secure mode, network access is restricted. User is only  allowed to access a few network addresses e.g., HTRC corpus repository and search service.

click the link

Run text analysis experiments in the VM. Details of conducting experiments are demonstrated in the 4 use cases below. If users want to export results out of the VM, they can release the result in the VM secure mode.

click the link

Use Cases

We walk participants through 4 use cases on using HTRC corpus for text analytics within the HTRC Data Capsule VM. For demo participants' convenience, the VM you just requested has been pre-loaded with required R packages and the IPython tool, along with a volume ID list of the English Short Title collection. All these use cases are to be operated within the VM.

Since it's performed in VM, it will be helpful to open a browser in the VM, e.g. FireFox, and go to the url http://wiki.htrc.illinois.edu/pages/viewpage.action?pageId=22085965 or http://bit.ly/1whzT6H Then you can easily copy and paste the hyperlinks and the commands from the Wiki. 

HTRC provides a search engine API, Solr API, for scholars to search volumes of their interest. Scholars can search by full-text, or MARC catalog fields. An example query is http://chinkapin.pti.indiana.edu:9994/solr/meta/select/?q=title:war which returns all volumes of which the titles contain "war".

click the link

Given a list of volume IDs supplied by users, the HTRC Feature API returns a Term-Document-Matrix (TDM) for the volumes. The matrix contains term frequency count information of each volume, which can be used for further statistical analysis. In this example, we use the English Short Title Catalog's volume ID list, to request its Term-Document-Matrix from the API.

click the link

Using the returned Term-Document-Matrix, we run some R analysis and visually show some insights of the collection (English Short Title Catalog).

click the link

Use the IPython interactive interface to fetch volume content, and then run vector space model and topic modeling on volumes' OCR content.

click the link

Finishing Up

Upon completion of the hands-on, please perform these steps to back up your results, exit the VM, and shutdown the VM. The next time you sign in to the portal, you can restart the VM and continue working within the same environment.

Resources

See the Data Capsule User's Guide for more details about interacting with HTRC Data Capsule VM.
  • No labels