Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Texts from the HTDL corpus that are not in the public domain are not available for download, which limits the usefulness of the corpus for research. However,  a great deal of fruitful research, especially in the form of text mining, can be performed on the basis of non-consumptive reading using extracted features (features extracted from the text) even when the full text is not available. To this end, the HathiTrust Research Center (HTRC) has started making available a set of page-level features extracted from the HTDL's public domain volumes. These extracted features can be the basis for certain kinds of  algorithmic analysis. For example, since topic modeling works algorithms work with "bags of words" (sets of tokens). Since , and since tokens and their frequencies are now being provided as extracted features, the EF dataset can enable a user to perform topic modeling with the data.

...