Notes of the user group meeting on Aug 20, 2014

Attendees: Charlotte Cubbage, Brendan Quinn, Claire Stewart, Geoff Swindells, Phil Burns, Bill Parod, Chris Comerford (all from Northwestern University)

Matt Wilkens (University of Notre Dame), Susette Newberry (Cornell University), Virginia Cole (Cornell University), Rachel Brekhus (University of Missouri)

HTRC team: Sayan Bhattacharyya, Loretta Auvil, Miao Chen

Meeting topic: This time we invited Claire Stewart from Northwestern University library along with her colleagues at Northwestern library to share their experience and thoughts on connections between HTRC and library services. The theme of the meeting will be digital humanities in library and HTRC's role in library service.

Audience of HTRC

Not a lot of faculty know what HT or HTRC are. They have a DH program and DH activiites. We can use shared syllabus, some kind of workshop to introduce HTRC, algorithms in HTRC portal. Not sure how many faculty are interested in large-scale data mining.

There has been a dramatic increase in interest in digital humanities, digital publishing and text-analysis type of projects. This is mostly from junior faculty and graduate students, but of late there has also been a surge of interest among advanced undergraduates.

Video tutorials that HTRC is developing will be useful for users. There is an "audience question": many find the HTRC tools to be fairly limiting, especially in comparison with commercial tools.

We don't know if the corpus tools fit the current interests of our linguistics faculty. It is a question that who are the general audience.

HTRC updates

Currently the HT has 3 million volumes of non-copyrighted materials and 8 million volumes of copyrighted materials.

HT-Bookworm (under development by HTRC and groups from Baylor and Northeastern) and the HTRC Data Capsule (which will allow users to run computational methods against protected data) are some techniques being developed that will make copyrighted material more usable to users in the future. In order for HT-Bookworm to work, the back-end will need to be changed from MySQL (as it is currently) to Solr. Solr is likely to be more scalable. The plan is also to integrate HT-Bookworm with HTRC worksets, so that one would be able to go in both directions — from the HTRC Data Capsule to the workset, and from the workset to the HTRC Data Capsule. That is, from what the user discovers using HT-Bookworm, the user might be able to automatically generate a workset. The goal is to make the HT-Bookworm work with all the public-domain material in HTRC within a year. If by then HTRC gets the copyrighted data, that will be sought to be integrated, too.

Tutorials

There will be more interest when there is more tutorial information. The more things they can point people to, the better.

Also, the things that faculty want to do don't always fit into existing tutorial information. Doing something more visually with topic models, such as Termite from Stanford, may be good.

Currently, some people are developing videos at UIUC scholarly commons, story board to help people understand HTRC functionalities

Algorithms and Text, inside or outside HTRC

If someone wants to submit an algorithm to the portal, how is it decided? Currently, it is decided on a one-on-one basis. Setting up a workflow/process for people to submit their algorithms will be useful.

What happens if someone wants to integrate textual content that they have on their own, with HTRC materials? Unfortunately, that (taking data from other sources) is not currently on the radar. But if someone were to want to augment their analysis with additional data, that is on the radar — for example, if a user were to augment their analysis by validating with a dictionary — that kind of thing is on the radar. But augmenting the data itself with an additional corpus — that has not been on the agenda.

Can we use HTRC suite of algorithms on non-HTRC data? Yes you can use the tools (by locating them from other resources, e.g. Mallet tools) to do it.

It was proposed that scholars can use HTRC algorithms on some text from local faculty which is cleaned, and then have it (the local text) available in HTRC so everybody can use it.

Possible use types

When faculty come and talk with librarians about analysis of digitized text, there are three main types of use cases:

1) They bring their own text

2) They bring a set of texts that they would like to use HTRC as a reference corpus for

3) They bring specific texts that they want to compare with texts that are in the HT corpus

Suggestions

No matter how big the HT corpus is, or how powerful the existing algorithms are, if the existing (corpus + algorithms) do not meet the needs of the specific problem that the user is trying to solve, then the user will not use the resources. Often, people come to librarians with their own texts (such as EEBO, ECHO, the Old Bailey text corpus, etc., and describe the problem and request an algorithm to be written to do the analysis that they are trying to do. Nowadays, even undergraduates come with quite complex tasks that they are trying to do.

Everyone's project is different from others, and so need new unique stuff to support the project. We need to have a path or mechanism to address a researcher's request for help on algorithm/data.

Sometimes, the text data set that the user is interested in using, are government documents that get set out periodically — all the articles put out within a certain time period.

Some suggestions of what the HTRC could do:

1) Try to create a bibliography of conference papers — to keep track of what researchers are doing using HTRC tools and resources. This is similar to what the ICPSR does, which has been mentioned and used in sociology classes. If you can get HTRC as such a source, and there could be more use. People need to be more aware of HTRC, and they need to be encouraged to use it.

2) People need to be encouraged to use HTRC as the source for the documents they need. Try to find a way to communicate with undergraduates directly, and let undergraduates know about HTRC and what it can do. Often, undergraduates will run with the resource, and convince faculty to let them use it for class projects, etc. Faculty members tend to be more reticent about using new resources than undergraduates are.

3) The crux of the matter is to come up with tutorials that people can play with whenever they have a few minutes free to play with it, and still make it into a useful learning experience. People's times are fractured in weird ways — especially students'. Sometimes they may have only a few minutes to play with the tutorial.

4) Try to make mechanisms to have people use complex stuff in an easy way. Tutorial is a great way to achieve it.