2019-2020 ACS awardees are sharing updates about their work to date roughly midway through their project cycles. Check them out below!
A Half-Century of Illustrated Pages: ACS Lab Notes —
ACS awardee: Stephen Krewson (Yale University)
We’ve reached the mid-point of my Advanced Collaborative Support project, “Deriving Basic Illustration Metadata.” Right now, sitting on a supercomputer named Big Red at Indiana University, is a rather remarkable dataset: every illustrated page from every Google-scanned volume in the HathiTrust Digital Library for the period 1800-1850. Although the image processing pipeline we are using is not new, working at this scale is.
HTRC & Me —
ACS awardee: Dan Sinykin (Emory University)
In fall 2018, the study of contemporary literature changed. HathiTrust opened its vault of copyrighted material to computational scholarly use. Before then, a scholar wanting to study contemporary literature at scale needed to collect the rare publicly available texts and purchase the rest privately, creating an expensive boutique corpus (text dataset) that could not be shared for collective use or for others to test one’s work’s replicability: severe limits. Now scholars can access hundreds of thousands of contemporary texts.
GlobalNames and the HathiTrust —
ACS awardees: Dmitry Mozzherin and Matt Yoder (Species File Group, Illinois Natural History Survey, Prairie Research Institute, University of Illinois)
Our team of researchers, the Species File Group, develop and use digital tools for biodiversity informaticians, those scientists who study the Earth's species. One of the things we focus on is locating information about the Earth's species via their scientific names, a project called GlobalNames. The idea is straightforward, find a biological name like _Homo sapiens_ (humans), _Apis mellifera_ (the Western honey bee), or _Anopheles gambiae_ (a mosquito that transmits Malaria), and you may discover information important to scientists "nearby". In the context of the GlobalNames project finding a name means parsing digitized literature or datasets, small or large. Thanks to funding from the National Science Foundation (NSF ABI 1645959, 2015) initial tools developed by Dmitry Mozzherin and Alex Myltsev were developed and hardened against the large, free corpus of scientific publications in the Biodiversity Heritage Library (BHL). Within the BHL the diversity of data (e.g. different languages, publication types, general quality of parsed text), and its structure therein let us find and resolve many edge cases in the name detecting algorithms. While finding specially formatted latinized names is challenging, the results of this work are fairly simple: at their core, they are an index indicating that “_this_ name was found _there_”. From these simple data many downstream features and explorations emerge, for example the list of names found on any given page of the BHL (e.g. Scientific Names on this Page), is derived from our tools.