The Trace of Theory (TracT) project looked at the question “Can we find and track theory, especially literary theory, in texts using computers?” We proposed to do this on the large collections of the HathiTrust using a variety of techniques with the support of the HathiTrust Research Centre. This project brought together researchers who are part of the Text Mining the Novel project (http://novel-tm.ca/) led by Dr. Andrew Piper at McGill University.
It takes a two-step approach to trying to track theory through its textual traces.
1. Subsetting: We propose to experiment with two methods for identifying “theoretical” subsets of texts from large collections like the Google-digitized dataset (GDD) of the HathiTrust. The goal would be to identify subsets of the full GDD that are theoretical in different ways.
2. Mining: We would then experiment with large-scale text-mining and clustering methods on these subsets. In particular we propose to try topic modelling and other forms of clustering.
Final project can be found at https://docs.google.com/document/d/1BwWd_tR6TtA7kp6QYQuAQte88Ri4Vvcx9Bho7NTKQ6o/edit?ts=5665d43e# please refer to the report for project background, technical details, and community impact.
Geoffrey Rockwell (Univ of Alberta), Laura Mandell (Texas A&M Univ), Stefan Sinclair (McGill Univ), Matthew Wilkens (Notre Dame), Susan Brown (Univ of Guelph)
Boris Capitanu (HTRC), Kahyun Choi (HTRC)
We extracted philosophical and literary critical texts by using list of keywords. A python script was developed to calculate the relative frequency of each word in a text and do it over a collection. The process also calculates the sum of the relative frequencies giving us a simple measurement of the use of philosophical keywords in a text. We tested this approach on the Project Gutenberg and HathiTrust collections. 2. Machine learning to identify subsets
2. Machine learning to identify subsets.
In addition to keyword list approaches we also tried machine learning approaches for identifying subsets.
We started from a training sets with the 20 philosophical texts and 20 non-philosophical texts from the Project Gutenberg mentioned above and applied a list of supervised algorithms. Below shows results using different algorithms.
b) unsupervised. We also used unsupervised methods to identify subgenres within a broadly philosophical corpus. Our approach mixed token unigram features with metadata and formal features in ways that may be portable to other text clustering and classification tasks. This work also demonstrates the practical use of the HathiTrust Research Center’s Extracted Feature dataset; interactive visualizations with Bokeh in Python; dimensionality reduction in complex datasets; and multiple algorithms for clustering tasks. Example visualizations are shown below.
3. Adapting the Galaxy Viewer.
Once we had viable ways of identifying subsets we worked on adapting a visual tool for exploring the subset called the Galaxy Viewer. This was originally developed by John Montague and Ryan Chartier. For this project it was for use on the HathiTrust collections by Ryan Chartier and Boris Capitanu. The general idea was to run topic modelling (Mallet) on the results of the keyword (or later machine learning) processes. The results are then presented in the Galaxy Viewer which lets you explore a “galaxy” of topics (See the figure below on the left). You click on a topic and it shows the words contributing significantly to that topic. Clicking on a document title now opens the HathiTrust reader so you can see the full text (see the figure below on the right).
Figure. Galaxy Viewer Exploring Literary Criticism Subset. Figure. HathiTrust Reader with the Robert Browning Text Seen in Galaxy Viewer
https://github.com/htrc/ACS-TT/blob/master/tools/notebooks/ClassifyingPhilosophicalText.ipynb (Classifying philosophical texts)
http://nbviewer.ipython.org/github/htrc/ACS-TT/blob/master/tools/notebooks/Unsupervised%20Clustering%20Philosophy.ipynb (Usupervised classification of philosophical genres)