Contents
Other EF Information
The HTRC Extracted Features (EF) dataset contains informative characteristics, at the page level, of text from public domain volumes in the HathiTrust Digital LIbrary (HTDL). These are slightly more than 5 million volumes, representing about 38% of the total digital content of the HTDL.
Texts from the HTDL corpus that are not in the public domain are not available for download, which limits the usefulness of the corpus for research. However, a great deal of fruitful research, especially in the form of text mining, can be performed on the basis of non-consumptive reading using extracted features (features extracted from the text) even when the full text is not available. To this end, the HathiTrust Research Center (HTRC) has started making available a set of page-level features extracted from the HTDL's public domain volumes. These extracted features can be the basis for certain kinds of algorithmic analysis. For example, since topic modeling algorithms work with "bags of words" (sets of tokens), and since tokens and their frequencies are now being provided as extracted features, the EF dataset can enable a user to perform topic modeling with the data.
Currently, the extracted features dataset is being provided in connection with worksets. (If you are not familiar with HathiTrust worksets, you may want to review the HTRC Workset Builder tutorial.)
The EF dataset for any HTRC workset can be retrieved as follows.
An EF data file for a volume consists of volume-level metadata, and of the extracted feature data for each page in the volume, in JSON format. The volume-level metadata consists of both volume metadata (metadata about the volume) and extracted features metadata (metadata about the extracted features).
The extracted features that HTRC currently provides include part-of-speech (POS) -tagged token counts, header and footer identification, and various information at the line level. Each page is broken up into three page sections: header, body, and footer.
For each page, the following are provided:
The corresponding fields for any page section (header, body, or footer) have the same names, but are for that specific page section:
Notes:
{ "id":"loc.ark:/13960/t1fj34w02", "metadata":{ "schemaVersion":"1.2", "dateCreated":"2015-02-12T13:30", "title":"Shakespeare's Romeo and Juliet,", "pubDate":"1920", "language":"eng", "htBibUrl":"http://catalog.hathitrust.org/api/volumes/full/htid/loc.ark:/13960/t1fj34w02.json", "handleUrl":"http://hdl.handle.net/2027/loc.ark:/13960/t1fj34w02", "oclc":"", "imprint":"Scott Foresman and company, [c1920]" }, "features":{ "schemaVersion":"2.0", "dateCreated":"2015-02-20T11:31", "pageCount":230, "pages":[ {"seq":"00000015", “tokenCount":212, "lineCount":38, "emptyLineCount":10, "sentenceCount":7, "languages":[{"en":"1.00"}], "header":{ "tokenCount":7, "lineCount":3, "emptyLineCount":1, "sentenceCount":1, "tokenPosCount":{ "I.":{"NN":1}, "THE":{"DT":1}, "INTRODUCTION":{"NN":1}, "DRAMA":{"NNPS":1}, "SHAKESPEARE":{"NNP":1}, "ENGLISH":{"NNP":1}, "AND":{"CC":1}}}, "body":{ "tokenCount":205, "lineCount":35, "emptyLineCount":9, "sentenceCount":6, "tokenPosCount":{ "striking":{"JJ":1}, "his":{"PRP$":1}, "plays":{"NNS":1}, "London":{"NNP":1}, "four":{"CD":1}, ".":{".":7}, "dramatic":{"JJ":2}, "1576":{"CD":1}, "stands":{"VBZ":1}, ... "growth":{"NN":1} } }, "footer":{ "tokenCount":0, "lineCount":0, "emptyLineCount":0, "sentenceCount":0, "tokenPosCount":{}}}]}} |