Child pages
  • HTRC User Getting Started FAQ

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

A: You may sign up for an account by going to the HTRC Production Portal http://htrc2.pti.indiana.edu and choose "Sign up" from the menu.

Q: How can I generate a list of volumes (such as N randomly selected volumes of non-fiction published in the nineteenth century)?

This will consist of two main steps:

Step 1:
You will need to come up with a list of volumeIDs (HathiTrust's ID strings for individual volumes in the HathiTrust collection) corresponding to those volumes whose full text you want. 
Step 2: 
You will request HathiTrust for a 'custom dataset' consisting of the full text of those volumes, by submitting to HathiTrust that list of volumeIDs you generated in Step 1.

Operationalization of Step 1:

My guess is that you will find your needs for this best served by the "metadata core" of the HTRC Solr Proxy API :

http://www.hathitrust.org/htrc/solr-api

As you would be able to see on the above page (to which the link above points): for the "metadata solr core", you can do queries (through the API) that search by various metadata fields such as (most importantly for your needs in this instance): 'genre' and 'publishDate'. The latter can be used as a 'range' field — as the doc says, if you specify the following in the query you make at the API:

publishDate : [1990 TO 1999]
then this will return all volume IDs whose publishDate is between 1800 and 1900 (inclusive of 1800 and 1900).

You can go by genre to decide whether a volume counts as non-fiction, but a problem here may be that 'genre' is often inaccurate or missing.

An alternative way to go about this, in case you are lucky enough to have data to tell you what you don't want (i.e. you simply don't want volumes that happen to be in a 'fiction' dataset), would be the following:

Rather than going by the 'genre' field at all, you can do a query on just the publication date range specification (and anything else you may need to restrict the search on, e.g. place of publication if you need it, etc., etc.)  — which will give you a long list of volumeIDs (of volumes that satisfy the specified criteria); and then you can run a script to  eliminate just those volumeIDs from that list that also occur in your list of volumeIDs of fiction (I'm assuming that you have access to a list of volumeIDs for fiction); and then, you can (programmatically) pick N volumeIDs at random from what's left.

So, once you've reached this point, you will have your  list of volumeIDs ready.

Operationalization of Step 2:

Then, you can submit those volumeIDs to HathiTrust, requesting from HathiTrust a "custom dataset"  consisting of the content of just the volumes corresponding to those volumeIDs. (The section "Custom Datasets" at the page http://www.hathitrust.org/datasets spells out the procedure for making that request to HathiTrust.)
This step  is slightly bureaucratic because your list of volumeIDs will almost invariably contain volumeIDs that correspond to volumes that were digitized by Google — which would necessitate that you sign a couple of statements and submit them to HathiTrust before you can receive your "custom dataset". 

At this point, you would be done.

Q: How do I access HTRC Production Stack?

...