Researcher Ted Underwood created a dataset to list word frequencies and help identify genre within a subset of HT volumes. Read about all the details of this dataset here.
Genre-specific word counts for 178,381 volumes from the HathiTrust Digital Library [v.0.1]
Links to specific files
For each genre, we provide a metadata file, a corrections file, and a yearly summary, as well as tar.gz files that aggregate individual volume-level wordcount files, sorted by estimated date of publication. Metadata files and yearly summaries are small (less than 30MB); some of the tar.gz files can be larger (up to 550 MB).
Primary funding for the "Understanding Genre" project was provided by the National Endowment for the Humanities and by the American Council of Learned Societies. Computation was supported by the Institute for Computing in Humanities, Arts, and Social Sciences at the University of Illinois (I-CHASS).
Any views, findings, conclusions, or recommendations expressed in this release do not necessarily represent those of the funding agencies.
Shawn Ballard, Michael L. Black, Jonathan Cheng, Nicole Moore, Clara Mount, and Lea Potter also deserve acknowledment for their work on the "Understanding Genre" project, particularly in the creation of page-level training data.