Child pages
  • Description of the HTRC Portal Algorithms
Skip to end of metadata
Go to start of metadata

Descriptions of the text analysis algorithms available in the HTRC Portal are listed below. They are also available on the Algorithms page in the Portal.

Thanks to Sayan Bhattacharyya, Loretta Auvil, Harriett Green, Thomas Padilla, Erica Parker, and others for their input on the descriptions. 

 

#NameSimple descriptionTechnical descriptionAuthorVersion
1

EF Rsync Script Generator

 

 

Generate a script that allows you to download extracted features data for your workset of choice. The script can be run locally, listing the Rsync commands to access the volumes of the workset.

Result of job: script to download extracted features data files

Generates a script to download the extracted features (EF) data for the specified workset using rsync. For more information on the extracted features data see https://analytics.hathitrust.org/features.

Note: Extracted features data was not created for a small number of volumes, so it is possible that not all of your workset volumes will be processed.

Colleen Fallaw2.0
2MARC Downloader

Download the bibliographic information for each volume in your workset of choice.

Result of job: zip (compressed) file that, when downloaded and expanded, contains the bibliographic metadata for each volume

Takes a workset as input and outputs the MARC (Machine Readable Cataloging) record for each volume in the workset in MarcXML format.Zong Peng1.5
3Meandre Classification Naive Bayes

Classify the volumes in a workset into categories of your choosing. Naïve Bayes classification is based on Bayes' Theorem from statistics, and uses machine learning to estimate the correct classification for a volume based on information present in the volumes of each particular class. You will need to upload a custom csv file with volumes classified in order to use this algorithm. Note that currently you cannot use this algorithm to classify an unlabeled workset, and therefore it does not work as fully as expected.

Result of job: text files of the results of the classification and the confusion matrix for the model

Performs Naïve Bayes classification on a workset (uploaded as a csv file) that contains the workset’s volume identifiers as a "volume_id" attribute and labeled data as "class" attribute.

The algorithm:

  • loads each page of each volume in the workset;
  • removes the first and last line of each page;
  • joins hyphenated words (if any) occurring at the end of each remaining line;
  • removes all tokens that do not consist of alphanumeric characters;
  • performs part-of-speech tagging (selecting nouns, verbs, adjectives and adverbs);
  • lowercases all tokens and counts the tokens remaining for each volume, eliminating all tokens with a count of 1;
  • the number of attributes (specified by the paramter num_attributes) to use is noted by the algorithm; the algorithm removes any attributes that exist in only one volume and also removes any attributes that exist in all volumes;
  • splits the data randomly into a training set (60% of the data) and a test set (40% of the data);
  • uniform discretization of frequency counts of the tokens is performed, by creating 2 equally spaced bins between the minimum and maximum for each scalar column; a NaiveBayes model is created on the training set, calculates the accuracy of the training set and test set and outputs the confusion matrix; the predicted value of the volumes in the training and testing sets are also outputted.


Note: Can only take up to 1000 volumes.

Loretta Auvil1.0
4Meandre Dunning Log-likelihood to Tagcloud

Compare and contrast two worksets by identifying the words that are more and less common in one workset, called the analysis workset, than in another workset, called the reference workset.

Result of job: tag cloud visualizations and lists of most and least commonly shared words in csv format

 

This algorithm:

  • calculates Dunning Log-likelihood based on two worksets provided as inputs: an “analysis workset” and a “reference workset”
  • this major functionality was developed as part of the Monk Project
  • loads each page of each workset, removes the first and last line of each page, joins hyphenated words that occur at the end of the line;
  • performs part of speech tagging (selecting only NN|NNS|JJ.*|RB.*|PRP.*|RP|VB.*|IN);
  • lowercases the tokens remaining;
  • counts the tokens remaining for all volumes for each collection;
  • compares counts from each collection using the Dunning Log-likelihood statistic; the "overused" tokens in the analysis collection (relative to the reference collection), 200 tokens by default, are displayed as a tag cloud and made available via a csv file; the "underused tokens" (also 200 tokens by default) in the analysis collection relative to the reference collection are, likewise, displayed as a tag cloud and made available via a csv file.

Note: The upper limit on the number of volumes is 1000.

Loretta Auvil1.1
5Meandre OpenNLP Date Entities To Simile

Visualize the dates in a workset on a timeline. Each date (ex. May 4, 1803) is displayed with its unique HathiTrust Digital Library volume identifier, the page on which it occurred, and a snippet of the sentence in which it occurred.

Result of job: timeline visualization

Information extraction is used to extract date entities that can be displayed on a timeline. This allows a researcher to review sentences that include dates via the timeline. We are using the OpenNLP system to automatically extract the entities from the text. The date entities, and the sentences in which they exist, are then displayed in Simile Timeline.

The algorithm

  • loads each page of each volume;
  • removes the first and last line of each page;
  • joins hyphenated words that occur at the end of the line;
  • for cleaning purposes, puts spaces around each of the following characters ", . ( ) [ ]";
  • extracts date entity types from the text;
  • displays each entity with the volume_id, page_id, and sentence snippet in Simile Timeline;

Note: The upper limit on the number of volumes is 100.

Loretta Auvil1.1
6Meandre OpenNLP Entities List

Generate a list of all of the names of people and places, as well as dates, times, percentages, and monetary terms, found in a workset. You can choose which entities you would like to extract.

Result of job: table of the named entities found in a workset

This algorithm

  • extracts named entities and provides this information in a table; (we are using the OpenNLP system to extract the entities from the text in an automated fashion);
  • loads each page of each volume from HTRC;
  • removes the first and last line of each page;
  • joins hyphenated words that occur at the end of the line;
  • extracts entity types specified from the text;
  • displays each entity with the volume_id, page_id, sentence_id and character position within the sentence.

Note: The volume limit is 100.

Loretta Auvil

1.1

7Meandre Spellcheck Report Per Volume

Find misspelled words that are the result of OCR errors in the text of a workset's volumes, with suggested replacements. Currently the replacements cannot be made within the Portal.

Result of job: lists of the misspellings in a workset, the number of times they occur, and suggested corrected spellings

The algorithm:

  • loads each page of each volume in the workset supplied to the algorithm as a parameter;
  • performs lowercase transformation of text;
  • provides several spelling statistics at a volume level for the html report; for the text file reports, information for each volume is displayed with a blank line separating the volumes; there are options to customize the dictionary, token counts, and transformation rules.

The token counts data is used to determine if a suggested dictionary word occurs in the token counts data, and whether it should be used. There are options for customizing the transformation rules which indicate the types of OCR errors that should be corrected. For instance, a known problem is the transformation of an "li" to an "h" and vice versa, and this is expressed with the transformation rule "li=h", which says that, for all misspelled words with an "h", a check will be done to see if a conversion to "li" forms a correctly spelled word.

Note: The upper limit on the number of volumes is 100.

Loretta Auvil1.1
8Meandre Tagcloud

Ceate a tag cloud visualization of the most frequently occurring words in a workset, as well as a list of the most frequent words. In a tag cloud, the size of the word is displayed in proportion to the number of times it occurred.

Result of job: list of most frequent words and a tag cloud visualization of them

This algorithm:

  • performs token counts and displays the most frequent tokens in a tag cloud;
  • counts the tokens for all volumes and displays the top 200 tokens in a tag cloud.

Notes:

  • No cleaning of the text is performed.
  • The upper limit on the number of volumes is 1000.
Loretta Auvil1.1
9Meandre Tagcloud with Cleaning

Performs cleaning of the text before it allows you to create a tag cloud visualization of the most frequently occurring words in a workset. In a tag cloud, the size of the word is displayed in proportion to the number of times it occurred.

Result of job: list of most frequent words in a workset's cleaned text and a tag cloud visualization of them

Performs token counts with some additional text cleaning and displays the most frequent tokens in a tag cloud.

  • performs token counts with some additional text cleaning and displays the most frequent tokens in a tag cloud;
  • loads each page of each volume from HTRC; Removes the first and last line of each page;
  • joins hyphenated words that occur at the end of the line;
  • performs lowercase transformation of text;
  • removes all tokens that don't consist of alphanumeric characters;
  • uses the replacement rules (learned from our usage of Google Ngrams data) to clean OCR errors, normalize to British spelling and normalize for period spelling;
  • filters stop words;
  • counts the tokens remaining for all volumes and displays the top 200 tokens in a tag cloud.

Note: The upper limit on the number of volumes is 1000.

Loretta Auvil1.1
10Meandre Topic Modeling

Identify "topics" in a workset based on words that have a high probability of occurring close together in the text. Topics are models trained on co-occurring text using Latent Dirichlet Allocation (LDA), where each topic is treated as a generative model and volumes are assigned a probability of how likely each topic is to have generated that text. The most likely words for a topic are displayed as a word cloud.

Result of job: xml file with topics, and visualizations of them in the form of tag clouds.


Performs topic modeling analysis in the style of Latent Dirichlet allocation (LDA) and its variants, notably the form used in Mallet.

This algorithm:

  • loads each page of each volume from HTRC;
  • removes the first and last line of each page;
  • joins hyphenated words that occur at the end of the line;
  • removes all tokens that do not consist of alphanumeric characters
  • filters stop words;
  • replaces "not " with "not_" to deal with negations;
  • creates a topic model using Mallet;
  • displays the top 200 tokens in a tag cloud. 


Note: The upper limit on the number of volumes is 1000.

Loretta Auvil1.1
11Simple Deployable Word Count

Identify the words that occur most often in a workset and the number of times they occur.

Result of job: list of the most frequent words in a workset and the number of times they occur

A simple word count Java client that uses the HTRC Data API to access the volumes in the specified workset, and displays the top N most frequently occurring words within that workset.Yiming Sun1.4

 

 

  • No labels

1 Comment

  1. The versions of some algorithms on Portal have changed.