The Trace of Theory (TracT) project looked at the question “Can we find and track theory, especially literary theory, in texts using computers?” We proposed to do this on the large collections of the HathiTrust using a variety of techniques with the support of the HathiTrust Research Centre. This project brought together researchers who are part of the Text Mining the Novel project (http://novel-tm.ca/) led by Dr. Andrew Piper at McGill University.
It takes a two-step approach to trying to track theory through its textual traces.
1. Subsetting: We propose to experiment with two methods for identifying “theoretical” subsets of texts from large collections like the Google-digitized dataset (GDD) of the HathiTrust. The goal would be to identify subsets of the full GDD that are theoretical in different ways.
2. Mining: We would then experiment with large-scale text-mining and clustering methods on these subsets. In particular we propose to try topic modelling and other forms of clustering.
Final project can be found at https://docs.google.com/document/d/1BwWd_tR6TtA7kp6QYQuAQte88Ri4Vvcx9Bho7NTKQ6o/edit?ts=5665d43e# please refer to the report for project background, technical details, and community impact.
Geoffrey Rockwell (Univ of Alberta), Laura Mandell (Texas A&M Univ), Stefan Sinclair (McGill Univ), Matthew Wilkens (Notre Dame), Susan Brown (Univ of Guelph)
Boris Capitanu (HTRC), Kahyun Choi (HTRC)
- Using keyword lists to identify philosophical and literary critical texts
We extracted philosophical and literary critical texts by using list of keywords. A python script was developed to calculate the relative frequency of each word in a text and do it over a collection. The process also calculates the sum of the relative frequencies giving us a simple measurement of the use of philosophical keywords in a text. We tested this approach on the Project Gutenberg and HathiTrust collections.
2. Machine learning to identify subsets.
In addition to keyword list approaches we also tried machine learning approaches for identifying subsets.
We started from a training sets with the 20 philosophical texts and 20 non-philosophical texts from the Project Gutenberg mentioned above and applied a list of supervised algorithms. See this iPython notebook for details. Below shows results using different algorithms.
b) unsupervised. We also used unsupervised methods to identify subgenres within a broadly philosophical corpus. Our approach mixed token unigram features with metadata and formal features in ways that may be portable to other text clustering and classification tasks. This work also demonstrates the practical use of the HathiTrust Research Center’s Extracted Feature dataset; interactive visualizations with Bokeh in Python; dimensionality reduction in complex datasets; and multiple algorithms for clustering tasks. See this iPython notebook for details. Example visualizations are shown below.
3. Adapting the Galaxy Viewer.
Once we had viable ways of identifying subsets we worked on adapting a visual tool for exploring the subset called the Galaxy Viewer. This was originally developed by John Montague and Ryan Chartier. For this project it was for use on the HathiTrust collections by Ryan Chartier and Boris Capitanu. The general idea was to run topic modelling (Mallet) on the results of the keyword (or later machine learning) processes. The results are then presented in the Galaxy Viewer which lets you explore a “galaxy” of topics (See the figure below on the left). You click on a topic and it shows the words contributing significantly to that topic. Clicking on a document title now opens the HathiTrust reader so you can see the full text (see the figure below on the right).
Figure. Galaxy Viewer Exploring Literary Criticism Subset. Figure. HathiTrust Reader with the Robert Browning Text Seen in Galaxy Viewer
Some of the insights derived from this work include the following, some of which is represented in the graph below (the Y axis represents annual means of philosophical classification scores (each text is scored from -1 to 1).
the HTRC Genre corpus has a lot of duplicate texts
the HTRC Genre corpus increases the number of volumes per year over time
there's an issue with the HTRC Genre corpus around the end of the 19th century
philosophical variation seems to increase over time
drama is the least philosophical genre
fiction and poetry seem to get less philosophical over time
From appearance to the naked eye, using the literary critical keywords seemed to improve the results. For more information about these test, please see: https://github.com/htrc/ACS-TT/blob/master/tools/notebooks/ClassifyingLitCrit.ipynb
Two conference panels were submitted, and one already been accepted:
CSDH (Canadian Society for Digital Humanties: Calgary May-June 2016) Panel proposal on “On the Track of Literary Theory and Philosophy: Explorations of the HathiTrust Collections”. (accepted)
- DH 2016 (Digital Humanities: Krakow July 2016) Panel proposal on “The Trace of Theory: Extracting Subsets from Large Collections” which includes presentations by HTRC staff. (pending)
The potential impact of this work is enourmous in the humanities. Few researchers have the ability to work with very large datasets like those of the HathiTrust. The first problem any humanist has is how to extract usable subsets in a fashion they understand. In particular the problem is extracting subsets where there is no metadata and where the criteria for selection is academic as in “literary critical works.” That is what academics want to explore - subsets that fit their interests. This project has shown that lists of keywords can produce viable subsets for further exploration. By viable we mean subsets that are not so large as to be meaningless, but still containing surprising titles beyond what one would have expected. This project has also shown that machine learning approaches can also help create subsets. We believe that a combination of keyword and machine learning techniques can be perfected that would make it easy for textual scholars to control the subsetting of very large corpora.
The second potential impact is adapting a visual exploration environment like the Galaxy Viewer (GV) to the the exploration of large subsets. Even with successful subsetting techniques, users get too many results to many skim. The GV gives humanists a viable way to explore results using topic modelling. It also allows humanists to then drill down to the actual texts as a way of checking results against the texts themselves. The challenge now is to refine this interface, compare it to alternatives, and test a robust version with a larger set of users. We believe that the GV could become part of a research interface to the open (and closed) HathiTrust collections that would make them accessible to a broad research audience.
https://github.com/htrc/ACS-TT/blob/master/tools/notebooks/ClassifyingPhilosophicalText.ipynb (Supervised learning: Classifying philosophical texts)
http://nbviewer.ipython.org/github/htrc/ACS-TT/blob/master/tools/notebooks/Unsupervised%20Clustering%20Philosophy.ipynb (Unsupervised learning: Usupervised classification of philosophical genres)