Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

HathiTrust+Bookworm (HT+BW) visualizes word trends in 13.7 million works held by the HathiTrust. It is currently being developed under a two-year NEH Implementation Grant, but it can already be tried at https://bookworm.htrc.illinois.edu/develop/.by HathiTrust. It enables scholars to discover new textual use patterns across the entire corpus, including in-copyright and public domain volumes.

The world's great research libraries have, over time, carefully assembled a rich body of metadata pertaining to the books in their collections. Since the HTRC has access to volume-level metadata as well as volume-level content, we have constructed a Bookworm of the HathiTrust corpus . We felt that setting up Bookworm with a HathiTrust corpus would provides in order to provide scholarly researchers with the means of exploring trends.

Goals

This tool enables scholars to discover new textual use patterns across the entire corpus. In the future, we plan to ingest the entire HathiTrust corpus and continue to identify appropriate metadata to use for the faceted browsing. This tool will be particularly useful to scholars interested in books that are still under copyright - which is the case for most books published after 1923. Although these books will not be available for reading or downloading online, working with individual words and phrases and tracking their occurrences through time will be useful to academic researchers, especially historians, sociologists and literary scholars. John Unsworth has noted that a fundamental goal of the humanities is appreciation: "by paying attention to an object of interest, we can explore it, find new dimensions within it, notice things about it that have never been noticed before, and increase its value" (2004). Shifting from traditional close reading to a large-scale view of text presents a profound discomfort for humanities scholars, due to the difficulty in retaining the same sensitivity to what is actually contained in the works being studied. HTRC-Bookworm will function as a link between quantitative analysis (distant reading) and close reading. According to Frederick Gibbs and Daniel Cohen, "any robust digital research methodology must allow the scholar to move easily between distant and close reading, between the bird's eye view and the ground level of the texts themselves" (2011). This is what HTRC-Bookworm intends to accomplish (, within the limitations of applicable copyright laws.

Using HT+BW

The data

HT+BW runs on top of HTRC Extracted Features data and represents snapshot of the HathiTrust when the Digital Library was at 13.7 million volumes. It performs best with modern English or European languages, and because of the way the input data was parsed to create the Extracted Features, it does not perform as well on non-Latin characters. As with the HathiTrust Digital Library, there are duplicative volumes represented in the data. The OCR has not been corrected and search is less accurate for volumes published prior to the nineteenth century. 

The metadata used to facet the search is pulled from the bibliographic metadata for the volumes. Some metadata fields are not universal for volumes in HathiTrust. For example, not every volume has a Library of Congress classification. Faceting based on those non-universal criteria will limit search to only volumes where that metadata field applies. 

HT+BW supports:

  • Single word (unigram) searches
  • String searches - no lemmatization or standardization of the tokens (i.e. words) has been imposed

Use the plus and minus icons to add or remove search fields. Each field will perform a separate search, with each search term appearing as a line on the graph. 


Image Added

Facet

The facet fields display the options for faceted search. Initially, clicking "All texts" will pop up a menu that displays the possible facet fields. After a selection is made, you can modify it again by clicking on that label. Initially, on clicking "All" or inside this box, a popup menu displays a scrolling list of values that can be selected for the given facet. Multiple values for a given facet field can be chosen by clicking inside this box. Multiple values in a facet field constitute a logical OR for search purposes. So, for instance, in the example below, the  search will be for "Genre:Biography OR Poetry". Different facet fields are in an implicit AND relationship with each other. In the figure below,  three facet fields have been specified. The search criteria specified in this example translate into "(Class:unknown) AND (Genre: Biography OR Poetry) AND (Format: Book)."

Image Added 

Adjust Settings

Clicking on the Settings button, which looks like a  gear, will display a menu for changing graph options to specify Time, Metric, Case and Smoothing.

Image Added

Time

Time range can be adjusted by dragging the "Time" slider for begin point and end point to the left and right as necessary.

Metric

This selection allows you to choose how the numerical values are counted. Depending on what option you choose, the label of the y-axis of the graph is changed accordingly, and the chart values adjusted. The four options available are as follows:

  • % of words shows how frequently the unigram is used relative to all other tokens in the corpus, for the given year shown. This lets you see how often a word is used relative to the size of the corpus, without having to worry about things like whether there are more books in 1850 than 1900.
  • % of texts gives the number of texts that use your search terms at least once as a proportion of the total number of texts published that year. Unlike "% of words," it will not be skewed by a single book that uses a word hundreds of times; however, it may be impacted by changing sizes.
  • Word count  plots the actual count of the searched word as the y-value for the plot.
  • Text count plots a count as the y-value for the plot that is computed in the following way: only those volumes in which the searched word actually occurs are counted for creating the plot. So, each such volume registers as only a single count. (The word "text" is being used interchangeably with the word "volume".)

Case

  • Insensitive ignores the distinction between lowercase and uppercase characters when counting words 
  • Sensitive maintains the distinction between lowercase and uppercase.

Smoothing

Smoothing is a means to create a moving average over the data and to identify overall trends by removing jagged and discontinuous data points.  Often trends become more apparent when data is viewed as a moving average. Smoothing windows are weighted: the year shown is weighted the most heavily, and the weights decrease in each direction until the smoothing span is reached. Smoothing options are described below:

  • To see the raw data points, set smoothing to 0. 
  • To average one point on each side of a data point, set smoothing to 1, which counts the previous one, current one, and next one and divides that sum by 3. 
  • A smoothing setting of 5 means that 11 values will be averaged, 5 values on each side of the data point.

Access volumes in the HathiTrust Digital Library from HT+Bookworm

You can click on any point in the plot to see a listing of the volumes by decreasing order of contribution to the plot at that particular year. Each volume title is a hyperlink, clicking on which will take you to the corresponding volume in the HathiTrust Digital Library.

Examples

Soda vs Pop

Soda or pop? How do Americans in different places refer to their soft drinks? Besides a variety of scientific papers and journal articles arguing about it, the Pop vs Soda project plots the regional variations in the use of the terms "pop" and "soda" to describe soft drinks. Current statistics from the project are available at http://popvssoda.com/statistics/USA.html. Their statistics and mappings are interesting to read. However, what if we want to look back into the history and find the hidden statistics? Where can we get historical evidence for our question? What will the results look like if the statistics are based on authorized publications rather than people's voting online? Try Bookworm! With data extracted from millions of publications from 1940 to 2015, we visualized the” Soda to Pop Ratio” by state . The y-axis represents the publication states while the x-axis shows the word ratio of “soda” to “Pop”. For example, we can see from the graph that publications in Massachusetts use soda for the soft drinks almost ten times the frequency of using pop.

...

Word Popularity of 4 countries after 1945

 

Using HT+BW

The search field

...

Facet fields

...

Adjusting Settings

...

Time

Time range can be adjusted by dragging the "Time" slider for begin point and end point to the left and right as necessary.

Metric

This selection allows you to choose how the numerical values are counted. Depending on what option you choose, the label of the y-axis of the graph is changed accordingly, and the chart values adjusted. The four options available are as follows:

  • % of words shows how frequently the unigram is used relative to all other tokens in the corpus, for the given year shown. This lets you see how often a word is used relative to the size of the corpus, without having to worry about things like whether there are more books in 1850 than 1900.
  • % of texts gives the number of texts that use your search terms at least once as a proportion of the total number of texts published that year. Unlike "% of words," it will not be skewed by a single book that uses a word hundreds of times; however, it may be impacted by changing sizes.
  • Word count  plots the actual count of the searched word as the y-value for the plot.
  • Text count plots a count as the y-value for the plot that is computed in the following way: only those volumes in which the searched word actually occurs are counted for creating the plot. So, each such volume registers as only a single count. (The word "text" is being used interchangeably with the word "volume".)

Case

...

...

Smoothing

Smoothing is a means to create a moving average over the data and to identify overall trends by removing jagged and discontinuous data points.  Often trends become more apparent when data is viewed as a moving average. Smoothing windows are weighted: the year shown is weighted the most heavily, and the weights decrease in each direction until the smoothing span is reached. Smoothing options are described below:

...

 

...

Accessing individual volumes in the HathiTrust Digital Library from HT+Bookworm

You can click on any point in the plot to see a listing of the volumes by decreasing order of contribution to the plot at that particular year. Each volume title is a hyperlink, clicking on which will take you to the corresponding volume in the HathiTrust Digital Library.

References

Michel, Jean-Baptiste, Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., Pickett, J. P., ... & Aiden, E. L. "Quantitative analysis of culture using millions of digitized books." Science331.6014 (2011): pp. 176-182.

...