Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Introduction

The Extracted Features (EF) dataset contains informative characteristics of the text at the page level from public domain volumes in the HathiTrust Digital LIbrary (HTDL). THese are slightly more than 5 million volumes, representing about 38% of the total digital content of the HTDL.

 Rationale

Texts from the HTDL corpus that is not in the public domain are not available for download, which limits its usefulness for research. However,  a great deal of fruitful research, especially in the form of text mining, can be performed on the basis of non-consumptive reading using extracted features (features extracted from the text) even when the full text is not available. To this end, the HathiTrust Research Center (HTRC) has started making available a set of page-level features extracted from the HTDL's public domain volumes. These extracted features can be the basis for certain kinds of  algorithmic analysis. For example, topic modeling works with bags of words (sets of tokens). Since tokens and their frequencies are provided as features, the EF dataset can enable a user to perform topic modeling with the data.

 


 

  • No labels