Child pages
  • The Trace of Theory project

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Overview

The Trace of Theory (TracT) project looked at the question “Can we find and track theory, especially literary theory, in texts using computers?” We proposed to do this on the large collections of the HathiTrust using a variety of techniques with the support of the HathiTrust Research Centre. This project brought together researchers who are part of the Text Mining the Novel project (http://novel-tm.ca/) led by Dr. Andrew Piper at McGill University.

...

Final project can be found at https://docs.google.com/document/d/1BwWd_tR6TtA7kp6QYQuAQte88Ri4Vvcx9Bho7NTKQ6o/edit?ts=5665d43e#   please refer to the report for project background, technical details, and community impact.

Personnel

Geoffrey Rockwell (Univ of Alberta), Laura Mandell (Texas A&M Univ), Stefan Sinclair (McGill Univ), Matthew Wilkens (Notre Dame), Susan Brown (Univ of Guelph)

Boris Capitanu (HTRC), Kahyun Choi (HTRC)

Workflow

  1. Using keyword lists to identify philosophical and literary critical texts

We extracted philosophical and literary critical texts by using list of keywords. A python script was developed to calculate the relative frequency of each word in a text and do it over a collection. The process also calculates the sum of the relative frequencies giving us a simple measurement of the use of philosophical keywords in a text. We tested this approach on the Project Gutenberg and HathiTrust collections. 2. Machine learning to identify subsets

2. Machine learning to identify subsets.

...

  Figure. Galaxy Viewer Exploring Literary Criticism Subset.     Figure. HathiTrust Reader with the Robert Browning Text Seen in Galaxy Viewer

Findings

Some of the insights derived from this work include the following, some of which is represented in the graph below (the Y axis represents annual means of philosophical classification scores (each text is scored from -1 to 1).

  • the HTRC Genre corpus has a lot of duplicate texts

  • the HTRC Genre corpus increases the number of volumes per year over time

  • there's an issue with the HTRC Genre corpus around the end of the 19th century

  • philosophical variation seems to increase over time

  • drama is the least philosophical genre

  • fiction and poetry seem to get less philosophical over time

Image Added

 

From appearance to the naked eye, using the literary critical keywords seemed to improve the results. For more information about these test, please see: https://github.com/htrc/ACS-TT/blob/master/tools/notebooks/ClassifyingLitCrit.ipynb

Community Impact

Two conference panels were submitted, and one already been accepted: 

  • CSDH (Canadian Society for Digital Humanties: Calgary May-June 2016) Panel proposal on “On the Track of Literary Theory and Philosophy: Explorations of the HathiTrust Collections”. (accepted)

  • DH 2016 (Digital Humanities: Krakow July 2016) Panel proposal on “The Trace of Theory: Extracting Subsets from Large Collections” which includes presentations by HTRC staff. (pending)

...

The second potential impact is adapting a visual exploration environment like the Galaxy Viewer (GV) to the the exploration of large subsets. Even with successful subsetting techniques, users get too many results to many skim. The GV gives humanists a viable way to explore results using topic modelling. It also allows humanists to then drill down to the actual texts as a way of checking results against the texts themselves. The challenge now is to refine this interface, compare it to alternatives, and test a robust version with a larger set of users. We believe that the GV could become part of a research interface to the open (and closed) HathiTrust collections that would make them accessible to a broad research audience.

Resources

Geoffrey Rockwell, Stéfan Sinclair, Laura Mandell, Susan Brown, and Matthew Wilkens. Project final report. https://docs.google.com/document/d/1BwWd_tR6TtA7kp6QYQuAQte88Ri4Vvcx9Bho7NTKQ6o/edit?ts=5665d43e#

https://github.com/htrc/ACS-TT/blob/master/tools/notebooks/ClassifyingPhilosophicalText.ipynb (Supervised learning: Classifying philosophical texts)

...