Child pages
  • Use Case: Perform Text Analytics Using Topic Explorer

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Use the IPython interactive interface to fetch volume content, and then run vector space model and topic modeling on volumes' OCR content. It uses the inpho/vsm python package, a textual semantics package developed by Dr. Colin Allen and his team locally at IU. 

This use case obtains some HTRC volume content, builds topic models based on the content, and then visualizes the topic models in a web browser.

VM Mode

This use case can be run in only secure mode in the VM. To export experiment results out of the VM, you need to release the result files in secure mode, and then receive results via email.

Example Use

First, switch the VM mode to secure mode

Second, edit the username and password with your portal username and password. The file is in ~/demo/vsm/DownloadVolumes.py. See the screenshot below. This is needed for HTRC Data API client. 

Image Removed

(done in the HTRC portal). 

In the VM, start a Terminal, and go change directory to the vsm experiment htrc-data folder

Code Block
languagebash
cd ~/demo/vsm
ls  #list the files

Start an IPython notebook server 

Code Block
.//home/dcuser/HTRC-Demos/Python/topicexplorer-demo

List the files of this folder

Code Block
languagebash
ls
 

Following are the files related to this analysis.

  •  htrc-demo.sh - This is the script for topic modeling analysis.
  • htrc-id - This file contains the list of volume ids. 

Run the topic modeling analysis

Note

Before running the topic modeling analysis, please check the script whether the 'secure_volume' path is mentioned correctly. Correct path should be '/media/secure_volume'


Code Block
languagebash
./htrc-demo.sh

You will see something like this in the popped-up browser. Click on the HTRC_vsm_corpus.ipynb 

Image Removed

In the HTRC_vsm_corpus.ipynb notebook, run all the scripts by clicking on "Cell -> Run All" on menu of the top of the page.

Image Removedconsole. This means the program is building topic models on the volume content. 

Image Added

It will take quite a while to finish the topic modeling due to the nature of this kind of computation. After the topic modeling process is done, you can view the result through the browser. (The browser will be automatically opened for you). Click on the "Topic" button.

Image Added 


Image Added

You will find the scripts run into errors if the VM is in maintenance mode.  

The demo code in HTRC_vsm_corpus.ipynb takes one HTRC volume, and 

  • clean up the content by handling page headers, line breaks, and hyphens
  • Build a Corpus object. It excludes words of which frequency < 3
  • Save the Corpus object for future reviisit

Then let's open another IPython notebook, HTRC_vsm_model.ipynb (list of IPython notebooks can be found at 127.0.0.1:8888/tree in the VM)

Run all the demo codes there by clicking on "Cell -> Run All"

Image RemovedIt is because this use case fetches HTRC content by using the Data API, which is only accessible in the secure mode. 

This demo code:

  • reads in a saved Corpus objectloads data from 3 volumes in HathiTrust using the HTRC Data API
  • builds an LDA topic model from the corpus
  • save the LDA trained model
  • view topics
  • display topics that relate to a list of words
  • display documents that are most likely generated by a specific topic
  • cluster topics based on LDA result
  • visualize clustered topics in 2-Din a web browser in an interactive way

Here are the scripts used in this example: topicexplorer-demo.zip