Child pages
  • EF Use Cases and Examples

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Consider the  compressed ‘advanced’ json file for The Works of Samuel Taylor Coleridge : Prose and Verse, and the compressed ‘advanced’ json file for The Poetical Works of John Keats, respectively.

    The first of these is  a file called wu.89001273754.advanced.json.bz2. It is a compressed json file in the “advanced” format of HTRC EF jsons, corresponding to a public workset that contained just a single volume, an edition of The Works of Samuel Taylor Coleridge: Prose and Verse (from the University of Wisconsin, Madison Library).

    The second of these is a file called uiug.30112065926690.advanced.json.bz2. It is a compressed json file in the “advanced” format of HTRC EF jsons, corresponding to a public workset that contained just a single volume, an edition of The Poetical Works of John Keats (from the University of Illinois, Urbana-Champaign Library).

    Note: In order to obtain the json files corresponding to a specific volume, you would need to create a custom workset consisting of that volume, and then download the EF dataset corresponding to that volume. (For details, see the 'Downloading Extracted Features' guide.) If, for learning purposes, you would rather access these two jsons directly, you can follow the instructions in this "activity worksheet".


  2. Uncompress these two files using bzip2, by running the following commands from the unix command line in the same directory where you have the compressed jsons from Step 1:

    bzip2 –d wu.89001273754.advanced.json.bz2

    and

    bzip2 –d uiug.30112065926690.advanced.json.bz2

    This will create the uncompressed files wu.89001273754.advanced.json (corresponding to the aforementioned volume The Works of Samuel Taylor Coleridge : Prose and Verse) and uiug.30112065926690.advanced.json  (corresponding to the aforementioned volumeThe Poetical Works of John Keats).

    (Why do these files have such long and difficult names? Because the filename is just the volumeID for the corresponding volume from the HathiTrust Digital Library, followed by .advanced.json .)

  3. Obtain the occurrences of ‘S’ (capitalized) in the json file corresponding to  The Works of Samuel Taylor Coleridge : prose and verse. You can accomplish this by running the following unix command at the unix command line:

    cat wu.89001273754.advanced.json | grep -o '\"S\"' | wc -l


    (Why the double quotation marks? Remember what beginLineChars looks like:

    "beginLineChars":
               {"E":2,"U":6,"F":1,"A":1,"I":2,"\"":7,"L":1,"C":2,"H":1,"W":5,"K":1,"O":1,"D":3,"S":1} )

    Note the answer returned by the unix command. Call the result CS (to stand for Coleridge, capitalized 'S').

    Now obtain the occurrences of ‘s’ (small) in the json file corresponding to to The Works of Samuel Taylor Coleridge : prose and verse. You can accomplish this by running the following unix command at the unix command line:

    cat wu.89001273754.advanced.json | grep -o '\"s\"' | wc -l

    Note the answer. Call it Cs (for Coleridge, small s). Calculate the proportion, for this book, of ‘S’ (capitalized) to all ‘s’ (capitalized or not). Call this quantity C. Thus, C = CS / (Cs + CS)

  4. Obtain the occurrences of ‘S’ (capitalized) in the json file corresponding to The poetical works of John Keats. You can accomplish this by running the following unix command at the unix command line:

    cat uiug.30112065926690.advanced.json | grep -o '\"S\"' | wc -l

    Note the answer. Call it KS (for Keats, capitalized S).

    Now obtain the occurrences of ‘s’ (small) in the json file corresponding to The poetical works of John Keats.  You can accomplish this by running the following unix command at the unix command line:

    cat uiug.30112065926690.advanced.json | grep -o '\"s\"' | wc -l

  5. Note the answer. Call it Ks (for Keats, small s). Calculate the proportion, for this book, of ‘S’ (capitalized) to all ‘s’ (capitalized or not). Call this quantity K.  Thus, K = KS / (Ks +KS)

  6. As the Keats volume consisted almost entirely of poetry, whereas the Coleridge volume contained both prose and verse, we would expect K to be a bit greater than C. Based on the values of K and C that you calculated, check whether this intuition is corroborated!

  7. Notes: 

     

    1. The above example makes a simplifying assumption for illustrative purposes. As the activity described in this worksheet is meant to be carried out within a limited-time workshop setting, we deliberately made the two worksets consist of a single volume each. However, this could introduce sampling error: for example, what if the particular edition of Keats’s collected works that was chosen for analysis contains a lot of textual apparatus in prose? That could make it seem, based on the above kind of analysis, that the proportion of prose in Keats’s work is more than that in Coleridge’s.  To guard against this kind of error, it is always good to take a large enough sample (something which is true of statistical analysis in general). For example, in this particular case, you would ideally want, in a real-life situation, to have several volumes of Coleridge’s collected works in a “Coleridge” workset, and likewise several volumes of Keats’s collected works in a “Keats” workset, and you would compute average (mean) values of C and K, to make sampling error less probable.

    2. If we were doing a rigorous analysis, we wouldn't have just counted occurrences of 'S' (or 's') in the jsons, as we just did,  but would instead have programmatically kept track of the occurrences of 'S' (or 's') at the beginning of lines in the page-bodies in the volume, based on analyzing the jsons (which would involve parsing the content of each beginLineChars and keeping a running (cumulative) count of the frequency of occurrence of "S", etc.). However, since we are looking at ratios (within books) and also comparing (across books), what we did does provide a rough-and-ready measure, without having to write the somewhat more complex code that such a full accounting would have necessitated. We will now show (below) what that programmatic computation will look like. (We have used python, a commonly used programming language.)

    You could programmatically compute the total number of occurrences of "S" in beginLineChars across the page-bodies of all pages for a volume with the HathiTrust volumeID 

    uiug.30112065926690 

    in the following way:

    python count_letter.py uiug.30112065926690.advanced.json 'S'

    where the file count_letter_begin.py consists of the following python program shown in the code block below.

    The code is general-purpose: you can use count_letter.py to count the number of occurrences of any letter within the beginLineChars fields in any "advanced" json that you have downloaded from the extracted features dataset: 
    python count_letter_begin.py  <any advanced json downloaded from EF data> '<letter whose total number of occurrences within the beginLineChars fields
    is sought to be counted within that volume>'

     

    Code Block
    # count_letter_begin.py
    # count the number of occurrences of any letter within 
    # the beginLineChars fields in any "advanced" json that 
    # you have downloaded from the extracted features dataset
    
    
    import sys
    import json
    from pprint import pprint
    
    def contains(list, filter):
       for x in list:
         if filter(x):
            return True
       return False
    
    with open(sys.argv[1]) as data_file:
     data = json.load(data_file)
    
    letter=sys.argv[2]
    
    count=0
    for n in range(0,len(data['features']['pages'])):
          temp = data['features']['pages'][n]['body']['beginLineChars']
          if contains(temp, lambda x: x == letter):
               count = count + temp[letter]
    pprint(count)
    
    

    Explanation of how the above python code works:

    data = json.load(data_file)
    pprint(data)


    will generate an output that looks like this:

    Code Block
    {u'features': {u'dateCreated': u'2015-02-19T04:15',
     u'pageCount': 272,
     u'pages': [{u'body': {u'beginLineChars': {},
       u'capAlphaSeq': 0,
       u'endLineChars': {}},
       u'footer': {u'beginLineChars': {},
       u'endLineChars': {}},
       u'header': {u'beginLineChars': {},
       u'endLineChars': {}},
       u'seq': u'00000001'},
             {u'body': {u'beginLineChars': {u'_': 1, u'r': 1},
       u'capAlphaSeq': 0,
       u'endLineChars': {u'L': 1, u'_': 1}},
       u'footer': {u'beginLineChars': {},
       u'endLineChars': {}},
       u'header': {u'beginLineChars': {},
       u'endLineChars': {}},
       u'seq': u'00000002'},
            {u'body': {u'beginLineChars': {u',': 1,
       u'5': 1,
       u'E': 1,
       u'I': 1,
       u'\\': 2,
       u'n': 1,
       u'w': 1},
       u'capAlphaSeq': 2,
       u'endLineChars': {u'.': 4,
       u';': 1,
       u'L': 1,
       u'P': 1,
       u'\u2018': 1}},
          etc

     

    You then have to add up the counts of "S" from each page-body's beginLineChars from the above output. You can do this in python in the following way:

    pprint(data['features']['pages'][n-1]['body'])

...