Use the IPython interactive interface to fetch volume content, and then run vector space model and topic modeling on volumes' OCR content. It uses the inpho/vsm python package, a textual semantics package developed by Dr. Colin Allen and his team locally at IU.
This use case can be run in only secure mode in the VM. To export experiment results out of the VM, you need to release the result files in secure mode, and then receive results via email.
First, switch the VM mode to secure mode.
In the VM, start a Terminal, and change directory to the htrc-data folder
List the files of this folder
Run the topic modeling analysis
You will see something like this in the console.
After the topic modeling process is done, you can view the result through the browser. (The browser will be automatically opened for you).
You will find the scripts run into errors if the VM is in maintenance mode.
This demo code:
- loads data from HTRC Data API
- builds an LDA model from the corpus
- save the LDA trained model
- view topics