Glossary

This glossary contains definitions and explanations for certain key terms and tools used in the Digging Deeper, Reaching Further (DDRF) Workshops. 

Algorithm

A process a computer follows to solve a problem, creating an output from a provided input.

AntConc

A freeware corpus analysis toolkit for text analysis, especially for analyzing concordances. See more at: http://www.laurenceanthony.net/software/antconc/ 

API (Application Programming Interface)

A set of clearly-defined communication methods (may include commands, functions, protocols, objects, etc.) that can be used to interact with an external system. They are basically instructions (written in code) for accessing systems or collections.

ArcGIS Online/StoryMaps

A visualization tool that can be used to incorporate GIS information and maps into interactive timelines and stories. See more at: https://storymaps.arcgis.com/en/

Bag-of-words model

A concept for working with text where all grammar and word order has been taken out and all the words are like being mixed up in a bag.

Beautiful Soup

A Python-based web scraping tool that pulls data out of HTML and XML files. It has several options for specifying what you want to scrape (within the HTML) and is good for getting clean, well-structured text.

Chunking text

The process of splitting text into smaller pieces before analysis. May be divided by paragraph, chapter, or a chosen number of words.

Command line

A text-based interface that takes in commands and passes them to the computer's operating system. Commands can be used to accomplish (and script) a wide range of tasks. The interface is often called a shell, such as the Bash shell.

D3.js

A JavaScript library for web-publishable visualizations.

Data visualization

The process of converting data sources into a visual representation. It often also refers to the product of this process.

DH Press

A digital humanities toolkit that enables users to mashup and visualize a variety of digitized humanities-related material, including historical maps, images, manuscripts, and multimedia content. It can be used to create a range of digital projects and is designed for non-technical users. See more at: http://dhpress.org

Distant reading

As compared to close reading, which finds meaning in word-by-word careful reading and analysis of a single work (or a group of works), distant reading takes large amounts of literature and understands them quantitatively via features of the text. (Conceptualized by Franco Moretti)

Exploratory data analysis

An approach for familiarizing oneself with a dataset before analyzing it that often involves visualizations, including visualizations of raw counts and simple statistics, or comparative visualizations.

File Transfer Protocol (FTP)

A protocol that computers on a network use to transfer files to and from each other. A protocol is a set of rules that networked computers use to talk to one another, like a language.

Functions

Reusable code blocks that perform an action.

Gephi

A free visualization and exploration software that can be used to create graphs and networks. It works especially well for exploratory data analysis. See more at: https://gephi.org

ggplot

Python library for data visualization.

ggplot2

R library for data visualization.

Google Books Ngram Viewer

A tool that enables users to search for words in corpora of texts and visualize their usage over time. See more at: https://books.google.com/ngrams

Grouping text

The process of combining text into larger pieces before analysis.

HathiTrust 

A library consortium founded in 2008. HathiTrust is a community of research libraries committed to the long-term curation and availability of the cultural record.

HathiTrust Digital Library (HTDL)

A digital preservation repository and highly functional access platform under HathiTrust. It provides long-term preservation and access services for public domain and in copyright content from a variety of sources, including Google, the Internet Archive, Microsoft, and in-house partner institution initiatives. Overall, the content mostly consists of digitized books from libraries.

HathiTrust Research Center (HTRC)

A research center under HathiTrust that facilitates computational, scholarly research using the 16+ million volumes in the HathiTrust Digital Library. The HTRC provides mechanisms for non-consumptive access to content in the HathiTrust corpus, as well as tools for computational text analysis.

HT Collection Builder

An interface for creating collections via the HathiTrust Digital Library.

HTRC algorithms

A set of off-the-shelf text analysis algorithms provided via HTRC Analytics for users to analyze their worksets, ranging from simple word count to sophisticated ones such as topic modeling.

HathiTrust + Bookworm 

A tool that visualizes word frequencies over time in the HathiTrust Digital Library. It can be accessed at: https://bookworm.htrc.illinois.edu/develop .

HTRC Extracted Features

A downloadable dataset of text data and metadata extracted and abstracted from volumes in the HathiTrust Digital Library.

HTRC Feature Reader

Python library for working with HTRC Extracted Features.

Humanities data 

In a humanities research setting, “data” can be defined as material generated or collected while conducting research. Humanities data may include databases, citations, software code, algorithms, documents, etc. (Adapted from definition provided in Data Management Plans for NEH Office of Digital Humanities Proposals and Awards)

Job (in HTRC context)

An algorithm run against a workset in HTRC Analytics.

Lexos

A web-based tool that can be used for pre-processing, analysis, and visualization of digitized texts. Lexos can also be downloaded and installed locally. See more at: http://lexos.wheatoncollege.edu/upload

Libraries/packages

Collections of functions that can be implemented in a script or program.

Machine learning

A process that gives computers the ability to learn without being explicitly programmed. Machine learning is based on researchers constructing and using algorithms that can learn from and make predictions on data. It can either be unsupervised (with minimal human intervention) or supervised (with more human intervention).

N-grams

A contiguous chain of n items from a sequence of text where n is the number of items. Unigrams refer to one item chains, bigrams to two item chains, and so on.

Naïve Bayes classification

A method based on Bayes’ Theorem from statistics that uses machine learning to classify texts based on information present in the texts of each class.

Named entity extraction

Using computers to locate and classify named entities (such as the names of persons, organizations, and locations) in text.

Natural Language Processing (NLP)

Using computers to understand the meaning, relationships, and semantics within human-language text.

Node-link diagram

A type of visualization for displaying networks. It captures entities (such as people, places, and topics) as nodes (also called “vertices”) and relationships as links (also called “edges”), with a circle or dot representing a node, and a line representing a link. 

NodeXL

An add-in for Microsoft Excel that supports social network and content analysis. Available in Basic and Pro versions. See more at: http://www.smrfoundation.org/nodexl/

Non-consumptive research

Research in which computational analysis is performed on text, but not research in which a researcher reads or displays substantial portions of the text to understand the expressive content presented within it.

OpenRefine

A tool like Microsoft Excel that is powerful for exploring, cleaning, and manipulating tabular data. Originally known as Freebase Gridworks and later as Google Refine, OpenRefine became an open community resource in 2012. See more at: http://openrefine.org

Optical character recognition (OCR)

Mechanical or electronic conversion of images of typed, handwritten or printed text into machine-encoded text. The quality of the results of OCR can vary greatly, and raw, uncorrected OCR is often “dirty”, while corrected OCR is referred to as “clean”.

Package Manager

A tool that facilitates the download and installation of programming packages.

pip

Package manager for Python (alternatives: Homebrew, Conda). 

pyplot

Visualization function in the Python data science package, Pandas.

Python

A programming language that is commonly used when working with data. Python has high-level data structures, is interpretive in nature, and has a relatively simply syntax.

PythonAnywhere

A browser-based programming environment that is also a code editor and file hosting service. It comes with a built-in Bash shell and does not interact with your local file system.

R

A programming language optimized for (statistical) data analysis.

Results (in HTRC context)

The results of your job(s) outputted by HTRC algorithms. You can view or download them.

rsync

A fast file-copying tool widely used for backups. It is well-known for its efficiency, because it reduces the amount of data sent over the network by sending only the differences between the files at the source location and the files at the destination location.

Script

A file containing a set of programming statements that can be run using the command line. Python scripts are saved as files ending with the extension “.py”.

Secure/SSH File Transfer Protocol (SFTP)

Works in a way similar to FTP, but is a separate protocol that encrypts the connection to enable a secure file transfer.

Sentiment analysis

Using computers to systematically identify attitudes or emotions present in text.

Stop words

Frequently used words (such as “the”, “and”, “if”) that are often removed from text before performing analysis.

Stylometry

The application of the study of linguistic style. It is often used to determine authorship to anonymous or disputed texts.

Tableau

A set of software that can be used for data preparation, visualization, and analysis. Among the different versions of Tableau Desktop (geared towards individual usage), Tableau Public is available for free. See more at: https://public.tableau.com/s/ and https://www.tableau.com

Text analysis

A form of data mining, using computer-aided methods to study textual data.

Text corpus/corpora

A “corpus” of text can refer to both a digital collection and an individual's research text dataset. Text corpora, the plural form, are bodies of textual data.

Tokenization

Breaking text into pieces called tokens. Often certain characters, such as punctuation marks, are discarded in the process.

Topic modeling

A method of using statistical models for discovering the abstract "topics" that occur in a collection of documents. 

Volume

Generally, a digitized book, periodical, or government document.

Voyant

A tool that can create many types of visualizations such as word clouds, bubble charts, networks, and word trees. It has a user-friendly interface that works great as a learning tool. See more at: http://voyant-tools.org/

Web scraping

The process of extracting data from webpages.

Weka

A collection of machine learning algorithms for data mining tasks. It contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. See more at: http://www.cs.waikato.ac.nz/ml/weka/

wget

A command line tool for retrieving files from a server. It can scrape the contents of a website, with options that can be modified to tailor more specifically to how you want the contents to be retrieved.

Word cloud/tag

A graphical representation of word frequency, usually presenting words that appear more frequently in the source text larger than those that appear less frequently.

Word tree

A type of visualization that displays the different contexts in which a word or phrase appears in a text, with the contexts arranged in a tree-like structure to reveal recurrent themes and phrases. 

Wordle

A tool for creating word clouds, mostly for exploration and decorative purposes because not much fine-tuning can be done. See more at: http://www.wordle.net

Workset

In the HTRC environment, a workset is a sub-collection of HathiTrust content created by users.