Descriptions of the text analysis algorithms available in the HTRC Portal are listed below. They are also available on the Algorithms page in the Portal.
|#||Name||Simple description||Technical description||Author||Version|
EF Rsync Script Generator
Generate a script that allows you to download extracted features data for your workset of choice. The script can be run locally, listing the Rsync commands to access the volumes of the workset.
Result of job: script to download extracted features data files
Generates a script to download the extracted features (EF) data for the specified workset using rsync. For more information on the extracted features data see https://analytics.hathitrust.org/features.
Note: Extracted features data was not created for a small number of volumes, so it is possible that not all of your workset volumes will be processed.
Download the bibliographic information for each volume in your workset of choice.
Result of job: zip (compressed) file that, when downloaded and expanded, contains the bibliographic metadata for each volume
|Takes a workset as input and outputs the MARC (Machine Readable Cataloging) record for each volume in the workset in MarcXML format.||Zong Peng||1.7|
|3||Meandre Dunning Log-likelihood to Tagcloud|
Compare and contrast two worksets by identifying the words that are more and less common in one workset, called the analysis workset, than in another workset, called the reference workset.
Result of job: tag cloud visualizations and lists of most and least commonly shared words in csv format
Note: The upper limit on the number of volumes is 1000.
|4||Meandre OpenNLP Date Entities To Simile|
Visualize the dates in a workset on a timeline. Each date (ex. May 4, 1803) is displayed with its unique HathiTrust Digital Library volume identifier, the page on which it occurred, and a snippet of the sentence in which it occurred.
Result of job: timeline visualization
Information extraction is used to extract date entities that can be displayed on a timeline. This allows a researcher to review sentences that include dates via the timeline. We are using the OpenNLP system to automatically extract the entities from the text. The date entities, and the sentences in which they exist, are then displayed in Simile Timeline.
Note: The upper limit on the number of volumes is 100.
|5||Meandre OpenNLP Entities List|
Generate a list of all of the names of people and places, as well as dates, times, percentages, and monetary terms, found in a workset. You can choose which entities you would like to extract.
Result of job: table of the named entities found in a workset
Note: The volume limit is 100.
|6||Meandre Spellcheck Report Per Volume|
Find misspelled words that are the result of OCR errors in the text of a workset's volumes, with suggested replacements. Currently the replacements cannot be made within the Portal.
Result of job: lists of the misspellings in a workset, the number of times they occur, and suggested corrected spellings
The token counts data is used to determine if a suggested dictionary word occurs in the token counts data, and whether it should be used. There are options for customizing the transformation rules which indicate the types of OCR errors that should be corrected. For instance, a known problem is the transformation of an "li" to an "h" and vice versa, and this is expressed with the transformation rule "li=h", which says that, for all misspelled words with an "h", a check will be done to see if a conversion to "li" forms a correctly spelled word.
Ceate a tag cloud visualization of the most frequently occurring words in a workset, as well as a list of the most frequent words. In a tag cloud, the size of the word is displayed in proportion to the number of times it occurred.
Result of job: list of most frequent words and a tag cloud visualization of them
|8||Meandre Tagcloud with Cleaning|
Performs cleaning of the text before it allows you to create a tag cloud visualization of the most frequently occurring words in a workset. In a tag cloud, the size of the word is displayed in proportion to the number of times it occurred.
Result of job: list of most frequent words in a workset's cleaned text and a tag cloud visualization of them
Performs token counts with some additional text cleaning and displays the most frequent tokens in a tag cloud.
Note: The upper limit on the number of volumes is 1000.
|9||Meandre Topic Modeling|
Identify "topics" in a workset based on words that have a high probability of occurring close together in the text. Topics are models trained on co-occurring text using Latent Dirichlet Allocation (LDA), where each topic is treated as a generative model and volumes are assigned a probability of how likely each topic is to have generated that text. The most likely words for a topic are displayed as a word cloud.
Result of job: xml file with topics, and visualizations of them in the form of tag clouds.
Performs topic modeling analysis in the style of Latent Dirichlet allocation (LDA) and its variants, notably the form used in Mallet.
|10||Simple Deployable Word Count|
Identify the words that occur most often in a workset and the number of times they occur.
Result of job: list of the most frequent words in a workset and the number of times they occur
|A simple word count Java client that uses the HTRC Data API to access the volumes in the specified workset, and displays the top N most frequently occurring words within that workset.||Yiming Sun||1.4|
- Naive-Bayes classification