Notes of the user group meeting on Jan 23, 2014

Attendees: Michelle A. Paolillo (Cornell), Matt Wilkens (Notre Dame), Adam Chandler (Cornell), Ian Barba (Texas University Library)

HTRC: Loretta Auvil, Harriett Green, Sayan Bhattacharyya, Miao Chen

2 topics were discussed:

1) feedback on HTRC Bookworm http://sandbox.htrc.illinois.edu/bookworm/

2) NLP features important for your research

Bookworm's integration with curated collection would be nice. Would be nice to have Bookworm working against work set

Matt: on NLP side of things, 2 things is useful to his research: POS tagging, named entity extraction (NER)
POS is relatively solved problem, with 97% accuracy, and it only need to be done once. There is no need to be offered interactively.
NER is also one time processing, but is nowhere near solved as POS, with a lot of errors, probably more importantly users might be able to supply their own data. He wondered how many potential users out there would have that kind of use cases.

Loretta: We have in tool set now. We has a list of things to add to Bookworm, and NER on the list. Will see if finally get implemented.
POS is well solved for modern corpus, but not for old corpus. You will see more noise as you go back in time.
Hearder/footers will cause problems, and she has it on the list of BW.

Matt: Ted Underwood has been working on hearder/footer removal. Not sure it's easy to integrate with current work flow.
For older POS, text processing group at Northwestern University work on that, for specially older stuff, e.g. shakespeare era. He thinks their training data is openly available.

The quality of NER is mixed. Even for people/org/locations, there are a lot of errors. People will question whether it's good enough for my work, if put publicly.

Question: Can I use regular expressions to coalesce word forms, or use wildcards to capture divergent forms, etc.?
Answer: It's not offered in current system probably due to the way it's implemented in mysql, but should be doable.
Michelle and Loretta have the impression that Bookworm creators are also involved to n-gram viewer. This indicates that Bookworm can be tweaked to do similar things to Google n-gram viewer.
Loretta: we can change back end to Solr, to scale up to 3M/10M books

Question: for metadata such as author/gender, publication, state, how complete is that metadata?

Loretta explains how gender metadata was derived, Stacy Kowalczyk and Peng Zong's work presented in last UnCamp.

Sayan: maybe we need to have something to distinguish between metadata that were decided by algorithm and metadata that already exist there

Loretta: any facets that people may come up in the future? We would like to have the ability to extend it.

Miao: now the x-axis is time, what else can be x-axis?

Michelle: one thing you can do is to segment individual text, to look word frequency of a book, divide certain number segments, and show the frequency of words over the content order of the book.
Miao mentioned similar work done by Wolfram Alpha.

Loretta: we can also slice time a bit differently.

Sayan asked Michelle: provide a single year, see what choose for a particular rather than multiple years?

Loretta: Bookworm has timeline better laid out, and will install the latest version.

Sayan: another possibility is to fix a particular place of publication, e.g. new york city, and find language use at different locations, from east to west, using different words, language variation, use distance measure; fix a particular place on the map, and do the visualization.

Miao: mentioned a problem of female vs. male writers on the word "revolution", the sum of "revolution" frequency by female writers and "revolution" frequency by male writers should be equal or less to frequency of "revolution", given a particular time point, but it's not as shown in HTRC Bookworm.
Loretta: it's because of smoothing issue. You can turn off smoothing option; also can use the exact count option instead of frequency percentage.

Adam: just started with HTRC. Ff you have resources set up meetings in Webex etc. for screen sharing, that would help explain things.

Loretta: will update to the newest version of bookworm UI, probably won't make changes to the backend code unless something incorrect about metadata.

When things change, we need to regenerate everything from the beginning. That's why we need to extend the scalability. adapt to metadata change, if some gender was wrong.

**************

Michelle Paolillo shared her feedback on HTRC Bookworm on usergroup list. We also went over the list. The listing order is the prioritization order in her mind.

I would like to search against a collection that I make – is this possible now? Planned? Is there a roadmap for this feature?
Can I use regular expressions to coalesce word forms, or use wildcards to capture divergent forms, etc.?
I like the faceting of metadata – but can we select multiple options within each facet? For instance, can I search for a term in publications from both French and English? Published in both the US and Canada?
Is there a way to integrate a “cleaning” option such that counts within page headers and footers can be excluded (to avoid over counting)?)
Can we download the data in tabular or CSV form? (Not just the picture of the data?)
I would like to enter year exactly. (The slider doesn’t seem that accurate.)
I really like the links to the texts. Makes this a discovery/exploratory tool as well as an analytical tool.