Child pages
  • Use Case: Perform Text Analytics Using Topic Explorer
Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 5 Next »

Use the IPython interactive interface to fetch volume content, and then run vector space model and topic modeling on volumes' OCR content. It uses the inpho/vsm python package, a textual semantics package developed by Dr. Colin Allen and his team locally at IU.

VM Mode

This use case can be run in only secure mode in the VM. To export experiment results out of the VM, you need to release the result files in secure mode, and then receive results via email.

Example Use

First, switch the VM mode to secure mode. 

Second, edit the username and password with your portal username and password. The file is in ~/demo/vsm/DownloadVolumes.py. See the screenshot below. This is needed for HTRC Data API client. 

In the VM, start a Terminal, and change directory the vsm experiment folder

cd ~/demo/vsm

List the files of this folder

ls

Start an IPython notebook server 

./demo.sh

You will see something like this in the popped-up browser. Click on the HTRC_vsm_corpus.ipynb 

In the HTRC_vsm_corpus.ipynb notebook, run all the scripts by clicking on "Cell" -> "Run All" on menu of the top of the page.

You will find the scripts run into errors if the VM is in maintenance mode. 

The demo code in HTRC_vsm_corpus.ipynb takes one HTRC volume, and 

  • cleans up the content by handling page headers, line breaks, and hyphens
  • Builds a Corpus object. It excludes words of which frequency < 3
  • Saves the corpus object for future revisit

Then let's open another IPython notebook, HTRC_vsm_model.ipynb (list of IPython notebooks can be found at 127.0.0.1:8888/tree in the VM)

Run all the demo codes there by clicking on "Cell" -> "Run All"

This demo code:

  • reads in a saved Corpus object
  • builds an LDA model from the corpus
  • save the LDA trained model
  • view topics
  • display topics that relate to a list of words
  • display documents that are most likely generated by a specific topic
  • cluster topics based on LDA result
  • visualize clustered topics in 2-D


  • No labels