Child pages
  • Extracted Features Use Cases and Examples

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

To start, we'll import Pandas, a very common data science library in Python, and the HTRC FeatureReader, a library written by HTRC to ease the use of the Extracted Features Dataset:

Next, we'll use each volume’s both volumes' HathiTrust IDs, their permanent, unique identifier in HathiTrust, to download their EF files and create FeatureReader Volume objects--which allow us to take advantage of volume-level data in the files--for each of our volumes of interest. A Volume in FeatureReader has specific properties and built-in methods that let us do a number of common tasks. We can create Volumes simply by using Volume()and adding the HathiTrust ID (HTID) in the parentheses:

...

After running the above code, the print statements will return and we’ll have indicate that we have successfully created Volume objects for each work by Le Guin.

...

We'll be focusing on one particular field of features in the EF Dataset: begin-line characters. This is because we’re working off of on the assumption that a page of poetry is likely to contain more capitalized letters at the start beginning of each line on a page, a common poetry convention that Le Guin is known to have usedfollowed. To start, we'll create a Pandas DataFrame with the first character of each line (begin_line character, _char) by page, for both of our volumes:

...