Skip to end of metadata
Go to start of metadata

HTRC Algorithms are click-to-run tools for text analysis. They require no programming, and researchers can set the parameters for their analysis. You can run HTRC algorithms against collections of HathiTrust data, called worksets, in order to analyze them or download their Extracted Features. Worksets can be cited, and researchers can choose to make their worksets public or private.

Run an algorithm Follow a tutorial

The basics

The data

HTRC Algorithms are designed to be run on sub-collections of volumes from HathiTrust, called worksets.  You cannot run HTRC Algorithms on non-HathiTrust data. 

HTRC Algorithms can analyze volumes in a workset so long as they have been synched with HTRC from HathiTrust. While syncing happens regularly, there may be occasional discrepancies. 

HTRC Algorithms can analyze in-copyright ("limited view") as well as public domain ("full view") volumes from HathiTrust. 

HathiTrust data is not exposed or viewable within HTRC Algorithms or worksets. A researcher applies an algorithm to their workset (collection) and the data is called and crunched behind the scenes. Only the results are viewable.

Create and browse worksets Follow a tutorial

Algorithm results

Every time you run an algorithm against a workset, that's called a job. You are able to view the status of the jobs that you have submitted. You can also delete jobs, for example if you have made an error in your set-up and want start again.

The results of your jobs are stored in HTRC Analytics and you can also download certain results files for each algorithm. 

Run an algorithm   Follow a tutorial

 The algorithms

NameDescriptionSpecifications and ParametersHow to utilize resultsAuthorVersion

EF Rsync Script Generator

 

 

Generates a shell script that allows you to download Extracted Features data for the volumes in a workset. The script lists the rsync commands to access the Extracted Features files, one for each volume in the workset. The script should be downloaded and run locally.

Result of job: A shell script to download extracted features data files

Volume limit: none

  • For more information on the Extracted Features data, see https://analytics.hathitrust.org/features.
  • If there is a volume in your workset for which there is not a corresponding Extracted Features file, that volume will be skipped in the shell script.
    • Extracted Features files are periodically generated and may be out-of-sync with the full HathiTrust corpus.
  • Download the shell script (EF_Rsync.sh)

  • If you like, rename and/or move the file from your download folder

  • From the Bash shell (Terminal on a Mac, or Cygwin/GitBash on Windows), navigate to the directory where your results script is saved. Then run the shell script, modifying the file name if you changed it earlier:
    sh EF_Rsync.sh
  • The HTRC Extracted Features json file for each volume in your workset will be transferred to your machine.
Colleen Fallaw3.0.2

InPhO Topic Model Explorer

Trains multiple LDA topic models and allows you to export files containing the word-topic and topic-document distributions, along with an interactive visualization. For full detailed description, please review the documentation.

Result of job: Four files are generated. Three are for displaying a visualization of topic clusters and top terms: topics.html, cluster.csv, topics.json. The final file (workset.tez) can be used with a local install of the Topic Explorer to access the complete word-topic and topic-document matricies, along with other advanced analytics.

Volume limit: 3000 volumes OR 3GB

  • Tokenization is performed on each volume using the topicexplorer init command, which:
    • Normalizes the text
    • Performs well with Indo-European languages, including English, Polish, Russian, Turkish, Greek, Italian, Latin, French, German, Spanish. Tokenizer does not perform well with East Asian languages or any other languages without spaces in the orthography.

  • Stoplisting is performed based on the frequency of terms in the corpus. The most frequent words (accounting for 50% of the workset) and the least frequent words (accounting for 10% of the workset) are removed.
  • Topic models are created based on the parameters set for the job.
    • Iterations: A lower number of iterations (i.e. 200 iterations) will process faster and is good for experimentation. A higher number will give publication-ready results (i.e. 1000 iterations).
    • Topics: You set the number of topics to be created, and multiple numbers can be added for one job. Entering "20 40 60 80" in the number of topics, the algorithm will train separate models with 20 topics, 40 topics, 60 topics and 80 topics.
  • Generates a bubble visualization that shows how topics across models cluster together.
  • More documentation of the Topic Explorer is available at https://inpho.github.io/topic-explorer/.




  • View the bubble visualization
    • This enables you to see the granularity of the different models and how terms may be grouped together into "larger" topics.
    • If you trained multiple models, you can toggle the views of the corresponding bubbles for each model.
    • You have the option to turn collision detection on and off.
  • View the topic file (JSON format) to see the topics and the corresponding probability for each word.
  • Download the topics.json file and do further analysis and visualization locally.
  • Download the workset.tez file to access a more robust visualization of your topics. You'll follow the instructions to install the InPho Topic Explorer locally, and then import your workset.tez file to view a more interactive visualization.
    • You'll use the "topicexplorer import" and "topicexplorer launch" commands as the models were trained already in the HTRC Analytics interface.
    • NOTE: we recommend renaming your workset.tez file once you download it in order to disambiguate results from multiple jobs you may want to load into the Topic Explorer.
    • You can watch a video of a locally-running Topic Explorer on YouTube.
Jaimie Murdoch1.0

Token Count and Tag Cloud Creator

Identify the tokens (words) that occur most often in a workset and the number of times they occur. Create a tag cloud visualization of the most frequently occurring words in a workset, where the size of the word is displayed in proportion to the number of times it occurred.

Result of job: tag cloud showing the most frequently occurring words, and a file(token_counts.csv) with a list of those words and the number of times they occur.

Volume limit: 3000 volumes or 3GB

  • Prepares the data by identifying page header/body/footer and extracting page body only for analysis. Combines of end-of-line hyphenated words in order to de-hyphenate the text

  • Removes stop words as specified by user

    • We provide access to a default list, or you can provide a link to your own list. To do so, create a stopword list and put it somewhere with a web-accessible URL. Add the URL to the appropriate field when setting parameters for your job.
    • If you leave this field blank, no stopwords will be removed.
  • Applies replacement rules (i.e. corrections) as specified by user, maintaining the original case of the replaced words

    • We provide access to a default list, or you can provide a link to your own list. To do so, create a CSV file where the first column is the word to be replaced and the second is the replacement. The file must have a header row. Put the file somewhere with a web-accessible URL. Add the URL to the appropriate field when setting parameters for your job.
    • If you leave this field blank, no replacements will be made.
  • Tokenizes the text using the Stanford NLP model for the language specified by the user, or does white-space tokenization
    • Enter the 2-letter code for the most prominent language in your workset.
    • If the language is not English (en), French (fr), Arabic (ar), Chinese (zh), German (de), or Spanish (es), then your text will be tokenized using whitespace.

  • Regular expression pattern matching is used to control what appears in the tagcloud.
    • We use provide a base regular expression that limits the display to only words that are made of letters or containing a hyphen.
    • This parameter does not affect what words will appear in the token count file.
  • Tokens are counted for the entire workset, and then sorted in descending order for the resulting token count file.

  • The top tokens are displayed in a tagcloud.

    • You choose how many tokens to display in your visualization.
  • View the tag cloud visualization
    • Most-frequently used words in the workset are displayed larger, while the less-frequently used words are displayed smaller
  • View the token count list (CSV format)
  • Download the token_counts.csv file
    • Compare the token counts for one workset to the counts of another workset
    • Create your own visualization
Boris Capitanu2

Named Entity Recognizer

Generate a list of all of the names of people and places, as well as dates, times, percentages, and monetary terms, found in a workset. You can choose which entities you would like to extract.

Result of job: a table of the named entities found in a workset (entities.csv)

Volume limit: 3000 volumes or 3GB

  • Prepares the data by identifying page header/body/footer and extracting page body only for analysis. Combines of end-of-line hyphenated words in order to de-hyphenate the text

  • Tokenizes the text using the Stanford NLP model for the language specified by the user

    • Enter the 2-letter code for the most prominent language in your workset.
    • This algorithm only supports English (en), French (fr), Arabic (ar), Chinese (zh), German (de), or Spanish (es). Entering the code for any other language will fail.

  • Performs entity recognition/extraction using the Stanford Named Entity Recognizer, and then shuffles the entities found on each page in order to prevent aiding page reconstruction.

  • Your results are saved to a file.


  • View the named entities list (CSV format), which shows the volume ID where the entity was found, the page sequence on which the entity occurred, the entity, and the entity type (person, place, etc.)
  • Download entities.csv for further analysis locally.
    • Compare the results of multiple worksets.
    • Create a visualization of the entities.
Boris Capitanu

2

Deprecated algorithms:

  • Naive-Bayes classification
  • MARC Downloader
  • Meandre Dunning Log-likelihood to Tagcloud
  • Simple Deployable Word Count
  • Meandre Topic Modeling
  • Meandre Tagcloud
  • Meandre Tagcloud with Cleaning
  • Meandre Spellcheck Report Per Volume
  • Meandre OpenNLP Entities List
  • Meandre OpenNLP Date Entities To Simile

 

  • No labels