Miao Chen and Nicholae Cline (both from HTRC) met with Tassie Gniady, Digital Humanities for Cyberinfrastructure Coordinator at Research Technologies at Indiana University on Dec 17, 2014 for a focused discussion on digital humanities and HTRC Data Capsule.
The meeting started with Miao introducing and demoing the HTRC Data Capsule. People interested can find a detailed tutorial at https://wiki.htrc.illinois.edu/display/COM/HTRC+Data+Capsule+Hands-on+Tutorial With this tool, digital humanities scholars and scholars in other fields can easily switch between secure mode (for data analysis without leaking data) and maintenance mode (for installing software necessary for the analysis). Tassie was also presented one use case in HTRC Data Capsule, i.e. using IPython notebook for text analytics on HTRC book volumes, making use of the VSM package developed by Dr. Colin Allen 's team.
Tassie provided the suggestions below:
1. The demo in IPython notebook is in need of more explanations of what each command does. It may appear obvious to people who frequently use it, but not so much to people first using it. For example, what does tokenization mean, what does the .npz output mean. These should be explained in plain language so that most people can understand it.
2. The log output in the demo IPython is very long. Though it's helpful in learning what the python program does by printing out the procedures or output, for a demo, it is not necessary to know all of them. For example, there was an output cell printing out all the tokens that the python program produced, which is not necessary, especially if we have lots of volumes in the data set. We can shorten the output by hiding the part of the output and just displaying the first and last several output strings.
3. The .npz file (a matrix output file produced by Python) appears to be confusing to people who don't know it. It should be explained in the comment cell of the IPython notebook.
4. Algorithms used in the demo IPython notebook need to be explained, ideally in a plain language understandable by non-experts. For example, the clustering algorithm can be explained, or alternatively, pointed to external resources having good explanation on such topics.
5. The HTRC book sample in the demo notebook is not well know by the large population, and she recommended using a book known by the majority, e.g. Shakespeare's books. Then it's much easier to interpret the results and explain the usage of the HTRC Data Capsule tool.
6. Programminghistorian http://programminghistorian.org/ is a good resource for explaining algorithms which can be used as external sources to point to for related topics.