Child pages
  • Use Case: Perform Text Analytics Using Topic Explorer

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

First, switch the VM mode to secure mode. 

Second, edit the username and password with your portal username and password. The file is in ~/demo/vsm/DownloadVolumes.py. See the screenshot below. This is needed for HTRC Data API client. 

Image Removed

In the VM, start a Terminal, and change directory to the vsm experiment htrc-data folder

Code Block
languagebash
cd ~/demo/vsmhtrc-data

List the files of this folder

Code Block
languagebash
ls

Start an IPython notebook server Run the topic modeling analysis

Code Block
languagebash
./htrc-demo.sh

You will see something like this in the popped-up browser. Click on the HTRC_vsm_corpus.ipynb 

Image Removed

In the HTRC_vsm_corpus.ipynb notebook, run all the scripts by clicking on "Cell" -> "Run All" on menu of the top of the page.

Image Removedconsole.  

Image Added

After the topic modeling process is done, you can view the result through the browser. (The browser will be automatically opened for you).

Image Added 

You will find the scripts run into errors if the VM is in maintenance modeThe demo code in HTRC_vsm_corpus.ipynb takes one HTRC volume, and 

  • cleans up the content by handling page headers, line breaks, and hyphens
  • Builds a Corpus object. It excludes words of which frequency < 3
  • Saves the corpus object for future revisit

Then let's open another IPython notebook, HTRC_vsm_model.ipynb (list of IPython notebooks can be found at 127.0.0.1:8888/tree in the VM)

Run all the demo codes there by clicking on "Cell" -> "Run All"

Image Removed

This demo code:

  • reads in a saved Corpus objectloads data from HTRC Data API
  • builds an LDA model from the corpus
  • save the LDA trained model
  • view topicsdisplay topics that relate to a list of words
  • display documents that are most likely generated by a specific topic
  • cluster topics based on LDA result
  • visualize clustered topics in 2-D