Before running an algorithm for text analysis, it’s important to understand the data being studied. For traditional research that utilizes tens relies on a small number of books, this step is covered when reading the material directly. For computational analysis, where often the focus are is sets of hundreds or thousands of books (what HTRC calls “worksets”), this is not achieved so easily. As a result, exploring and understanding our data becomes an explicit step in the research process. This is often the first step to computational analysis, but can present a number of challenges. For instance, If if we were interested in studying poetry through traditional means, we could simply visit our our local library, go to the poetry section poetry section and start reading. If we wanted to see if there are is poetry in volumes that are poetry but not included in shelved in this section, we could open them and see if we find poetry or prose. But when working with hundreds and or thousands of digital items, we do not have the ability to practically do either of these things. Some volumes we want to study may be under copyright, which means we can’t read them without purchasing. Others may have incorrect metadata records attached to them, which means we may not even know to look for them as poetry. Perhaps most challenging of all, it would take lifetimes to manually read thousands of books.
This is where the Extracted Features 2.0 Dataset can come in handy. Since the dataset is non-consumptive, we can use the data, without restriction, even to study volumes we cannot read due to copyright. Additionally, the dataset is heavily structured, removing the need for users to replicate common data analysis tasks like part-of-speech tagging, or page section separation. This structure also lets us quickly explore thousands of volumes, and do so relatively quickly.
Returning to our first example: suppose we are a scholar interested in studying large-scale trends in poetry. Our first step would be to assemble a workset of poetry in order to answer our research questions. With a massive digital library like HathiTrust, finding all of the items we are interested in is no easy task. However, we can use the EF 2.0 Dataset to assess if there are structural characteristics of texts that we know are correctly classified as poetry, and potentially develop a generalized method of identifying poetry based on these characteristics. We’ll be doing exactly this using two unidentified volumes by Ursula K. Le Guin, who published both prose and poetry.
To start, we'll import Pandas, a very common data science library in Python, and the HTRC FeatureReader, a library written by HTRC to ease the use of the Extracted Features Dataset:
Next, we'll use each of our volume’s HathiTrust IDs, their permanent, unique identifier in HathiTrust, to download their EF files and create FeatureReader Volume objects--which allow us to take advantage of volume-level data in the files--for each of our volumes of interest. A Volume in FeatureReader has specific properties and built-in methods that let us do a number of common tasks. We can create Volumes simply by using
Volume()and adding the HathiTrust ID (HTID) in the parentheses: