Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.


The world's great research libraries have, over time, carefully assembled a rich body of metadata pertaining to the books in their collections. Since the HTRC has access to volume-level metadata as well as volume-level content, we have constructed a Bookworm of the HathiTrust corpus in order to provide scholarly researchers with the means of exploring trends. John Unsworth has noted that a fundamental goal of the humanities is appreciation: "by paying attention to an object of interest, we can explore it, find new dimensions within it, notice things about it that have never been noticed before, and increase its value" (2004). Shifting from traditional close reading to a large-scale view of text presents a profound discomfort for humanities scholars, due to the difficulty in retaining the same sensitivity to what is actually contained in the works being studied. HTRC-Bookworm will function as a link between quantitative analysis (distant reading) and close reading. According to Frederick Gibbs and Daniel Cohen, "any robust digital research methodology must allow the scholar to move easily between distant and close reading, between the bird's eye view and the ground level of the texts themselves" (2011). This is what HTRC-Bookworm intends to accomplish, within the limitations of applicable copyright laws.

Using HT+BW

HathiTrust+Bookworm is available from the Explore page on HTRC Analytics. From that page, you can also find links to experiment "advanced" Bookworm interfaces.

Button Hyperlink
titleUse HathiTrust+Bookworm
Button Hyperlink
titleFollow a tutorial
urlHathiTrust+Bookworm step-by-step tutorial

The data

HT+BW runs on top of HTRC Extracted Features data and represents snapshot of the HathiTrust when the Digital Library was at 13.7 million volumes. It performs best with modern English or European languages, and because of the way the input data was parsed to create the Extracted Features, it does not perform as well on non-Latin characters. As with the HathiTrust Digital Library, there are duplicative volumes represented in the data. The OCR has not been corrected and search is less accurate for volumes published prior to the nineteenth century.