Child pages
  • EF Use Cases and Examples
Skip to end of metadata
Go to start of metadata


This page lists examples of what can be done with the Extracted Features dataset. For examples of scholars are using it, see also Extracted Features in the Wild.


Character Count Examples

Recall from the documentation that, as part of the page level information, one of the features in the advanced extracted-features dataset is a page-level feature called beginLineChars  (provided per page section, that is, per header, footer and page-body for each page). This feature is a list of all the characters (ignoring whitespace)  that ever occur at the beginning of a line in this page section with their corresponding counts (the count for a character being the number of times that particular character occurs at the beginning of a line among the lines in that page section).

Below is a typical example taken from an advanced extracted-features data file. In this example, looking at the data in json format,  we notice that, in the page-body section of the particular page that is being considered, two lines had the  letter “E” as the beginning character, six lines had the letter “U” as the beginning character, only one line had the letter “F” as the beginning character, …etc.:

"body":{"capAlphaSeq":3,
        "beginLineChars":
           {"E":2,"U":6,"F":1,"A":1,"I":2,"\"":7,"L":1,"C":2,"H":1,"W":5,"K":1,"O":1,"D":3,"S":1},
        "endLineChars":
           {"e":1,"n":2,".":6,"\"":6,",":9,";":3,"—":1,"ß":1,"?":1,":":4}}

 


With the help of use cases that involve features that are being made available as the “advanced” features data file (Use Case 1, below) and as the “basic” features data file (Use Case 2, below), we will describe some uses for the features. 

Verifying that the more poetry there is in an (English-language) volume, the higher is the proportion of capitalized letters. (Or: Coleridge wrote a lot more prose than Keats did!)

English verse traditionally (at least until the 20th century, anyway) has the first letter of each line capitalized.  Evidently, that is not the case for English prose!

This leads to the following intuition: the more poetry there is within a volume, the higher will be the proportion of characters in the textual content of the book, that are capitalized letters.

We will try to corroborate this intuition in the case of two worksets consisting, respectively, of a single volume of Coleridge’s works and a single book of Keats’s works. We expect the Keats volume to contain mostly poetry, whereas we expect the Coleridge volume to contain both poetry and prose. (Keats mainly wrote poems, while Coleridge wrote both a lot of prose and a lot of poetry!) 

  1. Consider the  compressed ‘advanced’ json file for The Works of Samuel Taylor Coleridge : Prose and Verse, and the compressed ‘advanced’ json file for The Poetical Works of John Keats, respectively.

    The first of these is  a file called wu.89001273754.advanced.json.bz2. It is a compressed json file in the “advanced” format of HTRC EF jsons, corresponding to a public workset that contained just a single volume, an edition of The Works of Samuel Taylor Coleridge: Prose and Verse (from the University of Wisconsin, Madison Library).

    The second of these is a file called uiug.30112065926690.advanced.json.bz2. It is a compressed json file in the “advanced” format of HTRC EF jsons, corresponding to a public workset that contained just a single volume, an edition of The Poetical Works of John Keats (from the University of Illinois, Urbana-Champaign Library).

    Note: In order to obtain the json files corresponding to a specific volume, you would need to create a custom workset consisting of that volume, and then download the EF dataset corresponding to that volume. (For details, see the 'Downloading Extracted Features' guide.) If, for learning purposes, you would rather access these two jsons directly, you can follow the instructions in this "activity worksheet".


  2. Uncompress these two files using bzip2, by running the following commands from the unix command line in the same directory where you have the compressed jsons from Step 1:

    bzip2 –d wu.89001273754.advanced.json.bz2

    and

    bzip2 –d uiug.30112065926690.advanced.json.bz2

    This will create the uncompressed files wu.89001273754.advanced.json (corresponding to the aforementioned volume The Works of Samuel Taylor Coleridge : Prose and Verse) and uiug.30112065926690.advanced.json  (corresponding to the aforementioned volumeThe Poetical Works of John Keats).

    (Why do these files have such long and difficult names? Because the filename is just the volumeID for the corresponding volume from the HathiTrust Digital Library, followed by .advanced.json .)

  3. Obtain the occurrences of ‘S’ (capitalized) in the json file corresponding to  The Works of Samuel Taylor Coleridge : prose and verse. You can accomplish this by running the following unix command at the unix command line:

    cat wu.89001273754.advanced.json | grep -o '\"S\"' | wc -l


    (Why the double quotation marks? Remember what beginLineChars looks like:

    "beginLineChars":
               {"E":2,"U":6,"F":1,"A":1,"I":2,"\"":7,"L":1,"C":2,"H":1,"W":5,"K":1,"O":1,"D":3,"S":1} )

    Note the answer returned by the unix command. Call the result CS (to stand for Coleridge, capitalized 'S').

    Now obtain the occurrences of ‘s’ (small) in the json file corresponding to to The Works of Samuel Taylor Coleridge : prose and verse. You can accomplish this by running the following unix command at the unix command line:

    cat wu.89001273754.advanced.json | grep -o '\"s\"' | wc -l

    Note the answer. Call it Cs (for Coleridge, small s). Calculate the proportion, for this book, of ‘S’ (capitalized) to all ‘s’ (capitalized or not). Call this quantity C. Thus, C = CS / (Cs + CS)

  4. Obtain the occurrences of ‘S’ (capitalized) in the json file corresponding to The poetical works of John Keats. You can accomplish this by running the following unix command at the unix command line:

    cat uiug.30112065926690.advanced.json | grep -o '\"S\"' | wc -l

    Note the answer. Call it KS (for Keats, capitalized S).

    Now obtain the occurrences of ‘s’ (small) in the json file corresponding to The poetical works of John Keats.  You can accomplish this by running the following unix command at the unix command line:

    cat uiug.30112065926690.advanced.json | grep -o '\"s\"' | wc -l

  5. Note the answer. Call it Ks (for Keats, small s). Calculate the proportion, for this book, of ‘S’ (capitalized) to all ‘s’ (capitalized or not). Call this quantity K.  Thus, K = KS / (Ks +KS)

  6. As the Keats volume consisted almost entirely of poetry, whereas the Coleridge volume contained both prose and verse, we would expect K to be a bit greater than C. Based on the values of K and C that you calculated, check whether this intuition is corroborated!

  7. Notes: 

     

    1. The above example makes a simplifying assumption for illustrative purposes. As the activity described in this worksheet is meant to be carried out within a limited-time workshop setting, we deliberately made the two worksets consist of a single volume each. However, this could introduce sampling error: for example, what if the particular edition of Keats’s collected works that was chosen for analysis contains a lot of textual apparatus in prose? That could make it seem, based on the above kind of analysis, that the proportion of prose in Keats’s work is more than that in Coleridge’s.  To guard against this kind of error, it is always good to take a large enough sample (something which is true of statistical analysis in general). For example, in this particular case, you would ideally want, in a real-life situation, to have several volumes of Coleridge’s collected works in a “Coleridge” workset, and likewise several volumes of Keats’s collected works in a “Keats” workset, and you would compute average (mean) values of C and K, to make sampling error less probable.

    2. If we were doing a rigorous analysis, we wouldn't have just counted occurrences of 'S' (or 's') in the jsons, as we just did,  but would instead have programmatically kept track of the occurrences of 'S' (or 's') at the beginning of lines in the page-bodies in the volume, based on analyzing the jsons (which would involve parsing the content of each beginLineChars and keeping a running (cumulative) count of the frequency of occurrence of "S", etc.). However, since we are looking at ratios (within books) and also comparing (across books), what we did does provide a rough-and-ready measure, without having to write the somewhat more complex code that such a full accounting would have necessitated. We will now show (below) what that programmatic computation will look like. (We have used python, a commonly used programming language.)

    You could programmatically compute the total number of occurrences of "S" in beginLineChars across the page-bodies of all pages for a volume with the HathiTrust volumeID 

    uiug.30112065926690 

    in the following way:

    python count_letter.py uiug.30112065926690.advanced.json 'S'

    where the file count_letter_begin.py consists of the following python program shown in the code block below.

    The code is general-purpose: you can use count_letter.py to count the number of occurrences of any letter within the beginLineChars fields in any "advanced" json that you have downloaded from the extracted features dataset: 
    python count_letter_begin.py  <any advanced json downloaded from EF data> '<letter whose total number of occurrences within the beginLineChars fields
    is sought to be counted within that volume>'

     

    # count_letter_begin.py
    # count the number of occurrences of any letter within 
    # the beginLineChars fields in any "advanced" json that 
    # you have downloaded from the extracted features dataset
    
    
    import sys
    import json
    from pprint import pprint
    
    def contains(list, filter):
       for x in list:
         if filter(x):
            return True
       return False
    
    with open(sys.argv[1]) as data_file:
     data = json.load(data_file)
    
    letter=sys.argv[2]
    
    count=0
    for n in range(0,len(data['features']['pages'])):
          temp = data['features']['pages'][n]['body']['beginLineChars']
          if contains(temp, lambda x: x == letter):
               count = count + temp[letter]
    pprint(count)

    Explanation of how the above python code works:

    data = json.load(data_file)
    pprint(data)


    will generate an output that looks like this:

    {u'features': {u'dateCreated': u'2015-02-19T04:15',
     u'pageCount': 272,
     u'pages': [{u'body': {u'beginLineChars': {},
       u'capAlphaSeq': 0,
       u'endLineChars': {}},
       u'footer': {u'beginLineChars': {},
       u'endLineChars': {}},
       u'header': {u'beginLineChars': {},
       u'endLineChars': {}},
       u'seq': u'00000001'},
             {u'body': {u'beginLineChars': {u'_': 1, u'r': 1},
       u'capAlphaSeq': 0,
       u'endLineChars': {u'L': 1, u'_': 1}},
       u'footer': {u'beginLineChars': {},
       u'endLineChars': {}},
       u'header': {u'beginLineChars': {},
       u'endLineChars': {}},
       u'seq': u'00000002'},
            {u'body': {u'beginLineChars': {u',': 1,
       u'5': 1,
       u'E': 1,
       u'I': 1,
       u'\\': 2,
       u'n': 1,
       u'w': 1},
       u'capAlphaSeq': 2,
       u'endLineChars': {u'.': 4,
       u';': 1,
       u'L': 1,
       u'P': 1,
       u'\u2018': 1}},
          etc

     

    You then have to add up the counts of "S" from each page-body's beginLineChars from the above output. You can do this in python in the following way:

    pprint(data['features']['pages'][n-1]['body'])

          will give you the contents of beginLineCharscapAlphaSeq and endLineChars of the nth page. 

          For example, for the file uiug.30112065926690.advanced.json

          pprint(data['features']['pages'][23]['body'])

         will give you the contents of beginLineCharscapAlphaSeq and endLineChars of the 24th page's page body, as follows:

{u'beginLineChars': {u'A': 6,
                     u'B': 3,
                     u'C': 1,
                     etc.
                     u'\u2019': 1},
 u'capAlphaSeq': 5,
 u'endLineChars': {u',': 13,
                   u'.': 3,
                   u':': 1,
                   u';': 3,
                   u'd': 2,
                   u'e': 6,
                   u'g': 2,
                   u'h': 2,
                   u'l': 1,
                   u'p': 1,
                   u'r': 2,
                   u't': 3}}

Discriminating poetry from prose

Suppose that you are looking for the book of poems Buch der Lieder (literal meaning: ‘Book of songs’), written in 1827  by the German poet Heinrich Heine. It is one of the most popular books of poetry ever in German literature and, as a result, it has been the subject of much critical commentary and secondary literature. 

One common problem, when you are trying to find a copy of a book such as this in the HathiTrust Digital Library (HTDL), is that the bibliographic metadata may not always be able to help you fully. For example, it is entirely possible that a book titled Buch der Lieder may actually be a commentary on, or a critical companion to, the actual book of poems (with that name) by Heine, which you are looking for.

You could, of course, look at the book using the “Full view” feature of the HathiTrust catalog to ascertain if the pages are in English or in German. However, if you are creating worksets on which to perform text analytics at scale (for example, if you are creating a workset consisting of volumes of poetry),  this sort of issue can become a problem, because you may not have the time to check each book with the same name individually to see if it is the book of poetry or merely commentary on the book. The beginLineChars feature provided as part of the ‘advanced’ feature files can, again, be of considerable help to you in this situation, in trying to discriminate poetry from prose.

How? Recall that for a lot of poetry (at least, for poetry in German or English in the nineteenth century anyway), the beginning of each line conventionally begins with a capitalized letter. This means that you will be able to tell which volumes in your workset are  truly volumes of poetry, by downloading the advanced features data and programmatically checking the beginLineChars feature for the page-body section of pages in each volume. If very few of the characters there happen to be small (non-capitalized)  letters, then the prediction is that the volume is a volume of poetry. For example, in the example of beginLineChars that we gave earlier, which was taken from a page of Heine’s Buch der Lieder  (occurring with volumeID mdp.39015012864743 in the HTDL collection), namely:


"beginLineChars":
 {"E":2,"U":6,"F":1,"A":1,"I":2,"\"":7,"L":1,"C":2,"H":1,"W":5,"K":1,"O":1,"D":3,"S":1},
 

we can conclude that the indication is strong that this particular page-section contains only poetry (as all the beginning-of-line characters are capitalized letters — which is very unlikely for prose).


Notes/caveats:

  1. Please remember that “lines” and “sentences” are different things, as far as the extracted features dataset is concerned.

 

Using Token Counts

Predicting readability for books

We will now consider another situation where the bibliographic metadata may not be sufficient. If you are interested only in those books in your workset that are above (or below) a certain grade-level in terms of readability, then it would be useful to have the grade-level readability information about each book included as part of the metadata for each book. However, this information is usually not a part of bibliographic metadata. Nevertheless, based on the features that are provided in the basic features files, it will be possible to predict the grade-level score for each book in the workset using  the commonly used Dale-Chall formula for grade-level-specific readability as shown below. [Dale, E. and J. S. Chall. "A Formula for Predicting Readability". Educational Research Bulletin. Jan 21 and Feb 17, 1948,  Vol. 27 : pp. 1–20, 37–54.]

The Dale-Chall formula predicts grade-level-specific readability on the basis of the following steps:

  1. Calculate the average sentence length (in words) for the book. (To do this using the provided features, you should ignore the header and footer page sections. Compute the average on the basis of the information you have about number of words in each page-body page section and the number of sentences in that page section.)

  2. Calculate P,  the percentage of words in the book that are not included in the Dale–Chall word list of 3,000 easy words. (The list pointed to here actually contains 2950 words; it seems that people speak of the round number "3000" when they speak of the Dale-Chall list, but all the actual examples I have seen have contained between 2940 and 2950 words.)

  3. Calculate the raw score for readability according to the formula

                           Raw Score = 0.1579*P+ 0.0496*(average sentence length) 
                     
                       
If the percentage of difficult words is above 5%, then add 3.6365 to the raw score to get the adjusted score, otherwise the adjusted score is equal to the raw score.

                 4. Calculate the grade level from the raw score according to these  guidelines:

                     4.9 and below —  Grade 4 and below

                     5.0 to 5.9 —  Grades 5–6

                     6.0 to 6.9 — Grades 7–8

                     7.0 to 7.9 — Grades 9–10

                     8.0 to 8.9 — Grades 11–12

                     9.0 to 9.9 — Grades 13–15 (college)

                    10 and above — Grades 16 and above


The average ratio of the number of tokens in each page section is available through the tokenCount feature in the basic features file, and  the number of sentences in each page section is available through the sentenceCount feature in the basic features file. This will allow you to carry out the computation in Step 1. For carrying out the computation in Step 2, you will be able to make use of the tokenPosCount feature, which provides part-of-speech-tagged tokens and their corresponding frequency counts.

Notes/caveats: 

1. Since the Dale-Chall formula is a heuristic formula, you will probably find it sufficient to take random samples of pages for the purpose of computation, rather than carry out the computations for every singile page. (The program below doesn't use sampling, however.)

2. Incomplete sentences both at the beginning of a page (because the sentence is being carried over after having begun in a previous page) or at the end of a page (because the sentence is being carried over to the subsequent page) increment the sentenceCount for the page by one (1). For this reason, aggregating values of sentencecount for sequences of pages may not provide the correct number of total sentences for the sequence of pages, and some kind of heuristic adjustment could be necessary.

The python code for the implementation is below. Click on "Expand Source" to view the code. (Note: This code assumes that you have the list of 3000-odd Dell Chall "easy" words stored in a text file called '
'DaleChallEasyWordList.txt' in the same directory that you are running this code in.

#!/usr/bin/env python
#

# compute_readability_score
# created by Sayan Bhattacharyya, June 2016

# prints out the appropriate grade (determined by
# reading level) for a volume, as computed on the
# basis of the Dale-Chall formula for grade-level-specific
# readability.

#
# Usage example:
#    readability.py  <basic_json_filename>

"""Based on the features that are provided in the basic Extrqacted-Features
features files, it is possible to predict the grade-level readability score
for a volume using  the commonly used Dale-Chall formula for grade-level-specific
readability as shown below.
[Dale, E. and J. S. Chall. "A Formula for Predicting Readability".
Educational Research Bulletin. Jan 21 and Feb 17, 1948,
Vol. 27 : pp. 1–20, 37–54.]

The Dale-Chall formula predicts grade-level-specific
readability on the basis of the following steps:

1. Calculate the average sentence length (in words) for the book.
(To do this using the provided features, we ignore the
header and footer page sections. Compute the average on the
basis of the information you have about number of words in each
page-body page section and the number of sentences in that
page section.)

2. Calculate P,  the percentage of unique tokens
in the book that are not included in the Dale–Chall
word list of 3,000 easy words. (We use the list at
http://countwordsworth.com/blog/dale-chall-easy-word-list-text-file/ ,
which actually contains 2950 words; it seems
that people speak of the round number "3000" when they speak
of the Dale-Chall list, but all the actual examples I have
seen have contained between 2940 and 2950 words.)

3. Calculate the raw score for readability according to the formula:

Raw Score = 0.1579*P+ 0.0496*(average sentence length)

If the percentage of difficult words is above 5%, then add 3.6365 to
the raw score to get the adjusted score, otherwise the adjusted score
is equal to the raw score.

4. Calculate the grade level from the raw score according to
these  guidelines:

                     4.9 and below —  Grade 4 and below
                     5.0 to 5.9 —  Grades 5–6
                     6.0 to 6.9 — Grades 7–8
                     7.0 to 7.9 — Grades 9–10
                     8.0 to 8.9 — Grades 11–12
                     9.0 to 9.9 — Grades 13–15 (college)
                    10 and above — Grades 16 and above

Note: The average ratio of the number of tokens in each page
section is available through the tokenCount feature in the
basic features file, and  the number of sentences in each
page section is available through the sentenceCount feature
in the basic features file. This allows us to carry out
the computation in Step 1. For carrying out the computation
in Step 2, we make use of the tokenPosCount
feature, which provides part-of-speech-tagged tokens and
their corresponding frequency counts.

Caveats:

1. Since the Dale-Chall formula is a heuristic formula,
it is reasonable to assume that it is sufficient to
take random samples of pages for the purpose of
computation, rather than carry out the computations for
every single page. We haven't done this in this script,
however, but if the workset is too large, you may want to
take this approach.

2. Incomplete sentences both at the beginning of a page (because the
sentence is being carried over after having begun in a previous page)
or at the end of a page (because the sentence is being carried over
to the subsequent page) increment the sentenceCount for the page by one
(1). For this reason, ideally, aggregating values of sentencecount
for sequences of pages may not provide the correct number of
total sentences for the sequence of pages, and some kind of
heuristic adjustment could be necessary. However, for
simplicity, we will not be doing that.
"""

from __future__ import print_function
import collections
import json
import sys
import itertools
from io import open

if __name__ == '__main__':
    if len(sys.argv) < 2:
        print("Missing filename argument")
        sys.exit(1)

    filename = sys.argv[1]
    with open(filename, 'r', encoding='utf8') as input:
        data = json.load(input)

    pages = data['features']['pages']
    word_counts = 0
    sentence_counts = 0
    token_counts = collections.Counter()

    # read the Dale Chall easy-word list into a set, programmatically;
    # this assumes that you have the Dale Chall easy-word list in
    # your current directory as the text file 'DaleChallEasyWordList.txt'.
    easyWordSet = set()
    def f(x):
       easyWordSet.add(x)
    for i in set(line.strip() for line in open('DaleChallEasyWordList.txt',"r")):
        f(i)

    # compute the average sentence length
    for page in pages:
        word_counts += page['body']['tokenCount']
        sentence_counts += page['body']['sentenceCount']
        token_pos_counts = page['body']['tokenPosCount']
        for token, pos in token_pos_counts.items():
            token_counts[token] += sum(pos.values())
    average_sentence_length = word_counts / sentence_counts

    # calculate p, the percentage of unique tokens in
    # the book not included in the Dale-Chall easy word list
    tokensNotInEasyWordSet = 0
    numberOfUniqueTokens = 0
    for token, count in token_counts.items():
         numberOfUniqueTokens += 1
         if token not in easyWordSet:
             tokensNotInEasyWordSet += 1
    p = tokensNotInEasyWordSet*100 / numberOfUniqueTokens

    # calculate the raw score
    rawScore = 0.1579*p + 0.0496*average_sentence_length

    # calculate the adjusted score:
    # if the percentage of difficult words is
    # above 5%, then add 3.6365 to the raw score
    # to get the adjusted score, otherwise the adjusted
    # score is equal to the raw score.
    adjustedScore = [rawScore + 3.6365, rawScore][p>5]

    # Now compute the grade-level based on the adjusted score.
    # We implement the required switch-case logic in a
    # "pythonic" way, with a dictionary (i.e. an associative array)
    # as much as possible, rather than with too many
    # if-elif statements.
    choice = -1
    if (adjustedScore - 5) <= 0:
       choice = 0
    elif (10 - adjustedScore) <= 0:
       choice = 6
    else:
       choice = int(adjustedscore - 4)
    if choice < 0:
       print("Choice cannot be negative. Something is wrong.\n")
       sys.exit(1)
    def p(num):
       print(u"Grades {}-{}\n".format(5 + (num-1)*2),(6 + (num-1)*2))
    choices = { 0 : lambda: print("Grade 4 and below\n"),
                    5 : lambda: print("Grades 13-15\n"),
                    6 : lambda: print("Grades 16 and above\n"), }

    # Print out the stats, followed by the grade level of readability at the very end
    print(u"Name of json = {}\n".format(filename))
    print(u"tokensNotInEasyWordSet = {}\tnumberOfUniqueTokens = {}\taverage_sentence_length = {}\n".format(tokensNotInEasyWordSet,numberOfUniqueTokens,average_sentence_length))
    print(u"Name of json = {}\tAdjusted Score = {}\tChoice  = {}\n".format(filename, adjustedScore, choice))
    print(u" ".format())
    print(u"Computed grade level of readability = ")
    if choice >= 1 and choice <= 4:
         p(choice)
    choices[choice]()

~                             

 

Testing a hypothesis about the incidence of a word’s occurrence (Or, Little Dorrit by Dickens mentions "prison" a lot more than his Bleak House does!)

Words related to prison and incarceration tend to be relatively more salient in Little Dorrit than in Bleak House. The plot summaries of these two novels by Charles Dickens (summarized/adapted from the Dickens Fellowship website) gives us an idea why:

Bleak House: A prolonged law case concerning the distribution of an estate, which brings misery and ruin to the suitors but great profit to the lawyers, is the foundation for this story. Bleak House is the home of John Jarndyce, principal member of the family involved in the law case.

Little Dorrit: Here Dickens plays on the theme of imprisonment, drawing on his own experience as a boy of visiting his father in a debtors' prison. William Dorrit is locked up for years in that prison, attended daily by his daughter, Little Dorrit. Her unappreciated self-sacrifice comes to the attention of Arthur Clennam, recently returned from China, who helps bring about her father's release but is himself incarcerated for a time. 


So, a reasonable hypothesis is that the novel Little Dorrit will tend to contain more occurrences of the word “prison” than the novel Bleak House does. We will verify this hypothesis using the HTRC Extracted Feature functionality.

  1. Consider the compressed ‘basic’ json file for Little Dorrit and the compressed ‘basic’ json file for Bleak House, respectively. The first of these is a file called hvd.32044025670571.basic.json.bz2 . It is a compressed json file in the “basic” format of  HTRC EF jsons, corresponding to a public workset called littledorrit1 — a workset that contained just a single volume, an edition of Little Dorrit. The second of these is a file called miun.aca8482,0001,001.basic.json.bz2 . It is a compressed json file in the “basic” format of HTRC EF jsons, corresponding to a public workset called bleakhouse1 — a workset that contained just a single volume, an edition of Bleak House.
     
  2. Uncompress these files by running the following commands from the unix command line: 

    bzip2 –d  hvd.32044025670571.basic.json.bz2 

    and 

    bzip2 –d   miun.aca8482,0001,001.basic.json.bz2  

    which  will create the uncompressed files hvd.32044025670571.basic.json (corresponding to Little Dorrit) and  miun.aca8482,0001,001.basic.json  (corresponding to Bleak House) . 

    (Why do these files have such long and difficult names? Because the filename is just the volumeID for the corresponding volume from the HathiTrust Digital Library, followed by  basic.json  .)

  3. Count the number of pages in this edition of Little Dorrit which the word ‘prison’ occurs at least once. You can accomplish this by running the following unix command at the unix command line: 

    grep -o  prison hvd.32044025670571.basic.json | wc –l    

    and you will find that the answer returned is 173. 

    Now count the number of pages in this edition of Bleak House which the word ‘prison’ occurs at least once. You can accomplish this by running the following unix command at the unix command line: 

    grep –o prison miun.aca8482,0001,001.basic.json | wc – l 

    and you  will find that the answer returned is 29.

  4. As 173 is much greater than 29, we have just verified that the word “prison” tends to occur much more frequently in Little Dorrit than in Bleak House

    Thus, our hypothesis was verified.

    Notice that this result is also strongly suggested by the Dunning log-likelihood statistic, which provides a measure of the relative salience of a word in one corpus with respect to another corpus. We can verify this by running Dunning’s log-likelihood algorithm, which is provided at the HathiTrust portal, on this pair of worksets (a workset containing an instance of Little Dorrit with respect to a workset containing an instance of Bleak House); see the section 'Case study: A “compare and contrast” exercise at the HTRC portal using the Dunning log-likelihood algorithm' in this poster ('Workset Builder and Portal of the HathiTrust Research Center', HathiTrust Research Center UnCamp, Ann Arbor, Michigan, 30-31 March 2015).

  5. Notes:

    1. The perceptive reader would note that a simplifying assumption was made. We simply counted the occurrences of the word “prison” within each  json. This does give us a rough estimate, but to make a precise, accurate computation, you should actually parse the jsons to  keep a running total of the occurrences of  “prison” in each page, for all the pages. (You can try it as an exercise. You will want to write a simple program using a programming language (e.g. python) to carry out such a computation. The spirit of the computation is somewhat similar to the computation described in Use Case 1a, above, on this page — the following python code (let's call this program count_word.py) accomplishes this, printing out the precise number of occurrences of the exact word 'prison' throughout a volume:

      #!/usr/bin/env python
       
      # count_word.py
      # created by Sayan Bhattacharyya, September 2015
      # count the number of occurrences of any word within 
      # a volume, based on the "basic" json file for that volume that 
      # you have downloaded from the extracted features dataset
      
      import sys
      import json
      import functools
      from pprint import pprint
      
      def contains(list, filter):
         for x in list:
           if filter(x):
              return True
         return False
      
      with open(sys.argv[1]) as data_file:
       data = json.load(data_file)
      
      word=sys.argv[2]
      
      count=0
      for n in range(0,len(data['features']['pages'])):
             if contains(data['features']['pages'][n]['body']['tokenPosCount'], lambda w: w == word):
                    mylist=data['features']['pages'][n]['body']['tokenPosCount'][word]
                    count=count+functools.reduce(lambda x,y: x + y, map(lambda z: mylist[z],mylist))
      pprint(count)


      Thus,

      count_word.py hvd.32044025670571.basic.json 'prison'

      will print out precise number of occurrences of the exact word 'prison' throughout the volume we selected for Little Dorrit.

      The code is general-purpose: you can use count_word.py to count the number of occurrences of any word in any "advanced" json that you have downloaded from the extracted features dataset: 
       count_word.py  <any basic json downloaded from EF data> '<word whose total number of occurrences is sought to be counted within that volume>'


    2. Another simplification made was to simply count occurrences of the exact word “prison”.  For a large sized sample, this ought not matter so much. However, as an exercise, you may want to redo the activity by considering any expression that contains the string "prison"  — e.g. “imprisonment”, “prisoner”, “prisons”, “prisoners”, etc. You will also want to ignore case.

      (Hint: If you are using the "rough" computation using grep, type man grep at the unix command line to see what other options there are for using grep, and see if you can make grep do this job for you; if you are using the "precise" computation using your own python program as above, think about how you can make use of regular expression functionalities to modify the python program provided above, to accomplish this.)
       

Identifying a volume in a workset in which a specified word occurs the most times (Or, which of the English romantic poets was the greatest dreamer among them all?)

Which among the four most important  English romantic poets (Wordsworth, Coleridge, Shelley and Keats) uses the word "dream" most in their writing? My bet will be on Coleridge, notorious for his opium addiction as well for certain poems having come to him in his dreams! That makes me curious as to which of these four (Coleridge, Wordsworth, Shelley and Keats) mention the word "dream" the most in their writing. Was it Coleridge? 

Here is a little python program to test out these kinds of guesses. The python script find_json_with_maximum_occurrences_of_word.py, below, prints out the name of the basic json in the current directory with the most occurrences of a specified word. Notice that it makes use of the code in word_code.py as a function.

Of course, to be more accurate you should compare not the raw number of occurrences of the specified word, but the ratio of the number of occurrences of the specified word to the total number of words in the volume. 

Suppose that you have the basic jsons corresponding to four volumes, namely the collected works of Coleridge, Wordsworth, Shelley and Keats respectively, sitting in your current directory. Then you will execute the following command to satisfy your curiosity about the above question:

find_json_with_maximum_occurrences_of_word.py <word>

e.g.:
 

find_json_with_maximum_occurrences_of_word.py 'dream'

 

#!/usr/bin/env python

# find_json_with_maximum_occurrences_of_word.py
# created by Sayan Bhattacharyya, September 2015
# prints out the name of that basic json in the current directory 
# with the most occurrences of a specified word. 
#
# (Of course, to be more accurate you ought to compare 
# not the raw number of occurrences of the specified word, 
# but the ratio of the number of occurrences of the 
# specified word to the total number of words in the volume!
# We are not doing that in this, simpler, program.) 
#
# Usage:
# Suppose that you have the basic jsons corresponding to four 
# volumes, namely the collected works of Coleridge, Wordsworth, 
# Shelley and Keats respectively, sitting in your current directory. 
# Then you will execute the following command: 
# about the above question:
# find_json_with_maximum_occurrences_of_word.py <word>
# e.g.:
# find_json_with_maximum_occurrences_of_word.py 'dream'




import sys
import json
import functools
import glob
from pprint import pprint

def contains(list, filter):
   for x in list:
     if filter(x):
        return True
   return False

def count_word(data):
   count=0
   for n in range(0,len(data['features']['pages'])):
          if contains(data['features']['pages'][n]['body']['tokenPosCount'], lambda w: w == word):
                 mylist=data['features']['pages'][n]['body']['tokenPosCount'][word]
                 count=count+functools.reduce(lambda x,y: x + y, map(lambda z: mylist[z],mylist))
   return count

word=sys.argv[1]

# assuming data files are in the current dir
path = './*.basic.json'

files = glob.glob(path)
max_count = 0
file_with_max_count = ''

for filename in files:
     with open(filename, 'r') as data_file:
        curr_data = json.load(data_file)
        curr_count=count_word(curr_data)
        if curr_count > max_count:
            max_count = curr_count
            file_with_max_count = filename
pprint(file_with_max_count) 

Co-occurrence and correlations (Or, is fear associated with more with the poor, or is fear associated more with prisons?)

Recently, researchers at Stanford University carried out a  study in which they mapped London's '“emotional geography” by categorizing 'what feelings or sensations common settings convey in the novels of Dickens, Thackeray, Austen and 738 other mostly 19th-century authors'. They found an interesting result:


 There were some surprises in the London research, said Ryan Heuser, an associate research director for the Stanford Literary Lab. “We thought fear would be linked to poverty,” Mr. Heuser said. But poorer sections of the city did not fare particularly badly in the literature. “Fear was more associated with ancient markets and prisons,” he added.

You can carry out a rudimentary version of a kind of analysis testing whether this holds for your own volumes of interest, using the EF dataset. For a suitable workset (for example, the novels by nineteenth-century authors that the Stanford researchers studied), you could compare the total number of co-occurrences of the words "fear" or "afraid" with "poor" or "poverty", with the total number of co-occurrences of the words "fear" or "afraid" with "prison" or "jail".

Notes:

1.Keep in mind, though, that, for the EF dataset, the finest "grain size" for co-occurrence that you can have is the page, as no positional information is available in the EF dataset. Still, you can get potentially interesting information by tracking in-page co-occurrences. 

2. For a better comparison, you would  probably want to compare not the raw numbers of co-occurrences, but the normalized numbers. That is, you would want to compare something like:  

(the total number of co-occurrences of the words "fear" or "afraid" with "poor" or "poverty") / (the total number of occurrences of the words "poor" or "poverty")

with:  

(the total number of co-occurrences of the words "fear" or "afraid" with "jail" or "prison") / (the total number of occurrences of the words "jail" or "prison").

Topic modeling with EF data

The EF dataset's per-page bags-of-words can be fed into a topic modeling algorithm. An example of this is in the following conference paper:
  •      Peter Organisciak, Loretta Auvil, J. Stephen Downie. 2015. “Remembering books: A within-book topic mapping technique.” Digital Humanities 2015. Sydney, Australia. (slides coming soon).

The specific scenario in this paper pertains to  "within-book" topic modeling, i.e. the text data, for the purposes of the paper, consisted of the EF data from single books. The approach would be the same in the general case as well, in which topic modeling is sought to be done on a large corpus at scale.There are two main ways to do topic modeling: LDA-based topic modeling using Mallet, and non-negative matrix factorization (NMF).


LDA-based topic modeling using Mallet
Mallet provides an LDA-based topic modeling functionality. See:

 ThDARIAH-DE initiative, the German branch of DARIAH-EU, the European Digital Research Infrastructure for the Arts and Humanities consortium, has a useful page describing how to do topic modeling using Mallet. Notice, however, that Mallet expects the input files to it to be text (.txt) files — in the example on the page, you will see that the novels (by Jane Austen and Charlotte Brontë) on which topic modeling is being performed in the example) have been split up into smaller sections called 

'Austen_Emma0000.txt', 'Austen_Emma0001.txt', 'Austen_Emma0002.txt', etc.

The extracted features data, of course, is available as bags-of-words, rather than as contiguous texts. So, in order to use Mallet, you should convert the unigram counts into text files. These text files will not contain meaningful "human-readable" text — only unigram instances (using frequency information).
 

Topic modeling using Non-negative matrix factorization (NMF)

An alternative to LDA-based topic modeling is a technique called non-negative matrix factorization (NMF). Here are instructions from the DARIAH-DE initiative as to how to perform this type of topic modeling using python.
 

Notes:

  1. Using EF data allows the user more flexibility (that is, it allows for the tuning of an additional "knob", so to speak) when doing topic modeling — one reason why a user may find it worth going the route of using EF data to topic-model a workset even though the HTRC Portal algorithm for topic modeling may already provide an existing, off-the-shelf possibility for doing so). When  doing topic modeling with an external topic modeler (such as Mallet) with EF data, the user has the flexibility of deciding what grain size should count as  a "document". For the topic modeling algorithm provided in the HTRC Portal, a page implicitly counts as a "document" (for the purposes of the algorithm), and the user does not have control over that "grain size" (i.e. cannot change the "grain size" of what constitutes a "document" for the purposes of the algorithm). However, when doing topic modeling using an external topic modeler with EF data, one has control over what "grain size" constitutes a "document":  although the EF data, in and of itself, is organized by pages, one can programmatically reorganize it into data with a larger grain size if needed (though, of course, not into smaller grain sizes). So, for example, using cues (features) such as blank-lines at ends-of-chapters, etc., as heuristics, one could theoretically split up (not with 100% accuracy, of course) a corpus of, say, novels, into individual chapters rather than pages (by programmatically combining the corresponding, appropriate page-level bags-of-words together), and one could then treat each chapter as a "document" for the purpose of the external topic modeling algorithm. A book chapter may well constitute, for some purposes, a more coherent and logical "grain size" than a page, which latter is an arbitrary "grain size" ultimately determined by physical considerations of the codex form, rather than by logical differentiation of the content (in the way a chapter is). 

  2. As Megan Brett points out in her 2012 JDH paper, for a topic modeling algorithm to produce meaningful results the number of "documents" must be of the order of several hundreds. (Please refer to the slide deck for the 2015 Advanced HTRC Workshop at HILT  for a discussion of this, and of what a "document" means in this context.) So, you always need to make sure, when you prepare/reorganize the EF data to feed into an external topic modeling algorithm, that the number of "documents" when doing topic modeling with EF data is of the order of several hundreds.

  3. For a concise description of what topic modeling consists of, see the slides titled 'Topic Modeling (TM) algorithm: What is happening “under the hood”?' in the following slide deck available on the 'HTRC Publications, Presentations' page:

How to use Ted Underwood's genre-classified dataset


Using extracted unigrams, Ted Underwood has created page-level maps of genre from English-language monographs in the HathiTrust Digital Library for the period 1700-1922, using probabilistic machine learning algorithms. The goal is to show how literary scholars can use machine learning for selecting genre-specific collections rom digital libraries. The genres that have been extracted can roughly be described as “Imaginative literature” — prose fiction, drama and poetry (assumed to be non-overlapping, discrete categories). Non-fiction prose and paratext have been identified and put into their own categories. He has made the datasets available on the web. These datasets provide you one way to use extracted unigrams in your research. They are useful in situations when you want your data pre-classified by genre. We will show you an example of how you can use this dataset to confirm / disconfirm hypothesis.

In a recent blog post, Chris Forster has (among other things) used Ted's dataset to plot normalized frequencies, by year, for the  word 'america' in english-language fiction in the HathiTrust Digital Library for the period 1700-1922. As he mentions, the plot seems "reasonablish", showing that mentions of "america" remained fairly constant until 1830s or so, and then started slowly and steadily rising. 

These kinds of trend plots can be particularly interesting when applied to pairs of words, to see how they fared comparatively to each other. Consider the words 'woman' and 'lady' — here is a hypothesis about them: With the rise of relatively more freedom and independence for women, as well as more democratization and decrease of class hierarchies, we should expect to see some decline in normalized counts of occurrences of "lady" and some uptick in normalized counts of occurrences of "woman" in prose fiction, over chronological time. I wanted to confirm if this actually happens, by plotting the normalized frequencies, by year, for the  words 'lady' and 'woman' in english-language fiction in the HathiTrust Digital Library for the period 1700-1922. I used Chris Forester's R script from his blog post, above, and generated the plots for 'lady' and 'woman' for the genre-classified english-language fiction dataset.

The plots look like this.

As you can see, normalized occurrences of "lady" has an inverted 'v'-shaped curve, slowly and steadily increasing until about 1810 and  slowly and steadily decreasing subsequently. Normalized occurrences of 'woman' remain more or less constant until about 1840, and subsequently keeps increasing slowly and steadily. Thus, our hypothesis seems to be borne out by the plots — as we reach modern times, occurrences of "lady" start declining in fiction and occurrences of "woman" increase.

Creating a listing of all the unique words in a volume and their respective counts

In this example, we show how to provide a listing of all unique words within a specified volume (obtained from its "basic" json data file), and the corresponding counts of the number of occurrences of that word within the volume.

#!/usr/bin/env python

"""Use Case 8: Creating a listing of all the unique words in a volume and their respective frequencies

Provide a listing of all unique words within a specified volume  (obtained from its "basic" json data file),
and the corresponding counts of the numbers of occurrences of those words within the volume.

Usage example:
    create_listing_of_words_and_frequencies_for_volume.py  <basic_json_filename>

The output generated will be written into a file called "results.txt" in your directory.
"""
from __future__ import print_function
import collections
import json
import sys
from io import open

if __name__ == '__main__':
    if len(sys.argv) < 2:
        print("Missing filename argument")
        sys.exit(1)

    filename = sys.argv[1]
    with open(filename, 'r', encoding='utf8') as input:
        data = json.load(input)

    pages = data['features']['pages']

    token_counts = collections.Counter()

    for page in pages:
        token_pos_counts = page['body']['tokenPosCount']
        for token, pos in token_pos_counts.items():
            token_counts[token] += sum(pos.values())

    with open('results.txt', 'w', encoding='utf8') as output:
        for token, count in sorted(token_counts.items()):   # if you want to write out the counts in decreasing token count
                                                            # order, replace 'items()' with 'most_common()' and remove the  	
															# 'sorted(...)'
            output.write(u"{}\t{}\n".format(token, count))

  • No labels