...
The demo code in HTRC_vsm_corpus.ipynb takes one HTRC volume, and
- clean cleans up the content by handling page headers, line breaks, and hyphens
- Build Builds a Corpus object. It excludes words of which frequency < 3
- Save Saves the Corpus corpus object for future reviisitrevisit
Then let's open another IPython notebook, HTRC_vsm_model.ipynb (list of IPython notebooks can be found at 127.0.0.1:8888/tree in the VM)
...