1. HTRC Data API

The HTRC Data API is a RESTful web service allowing clients to retrieve digital text by providing volume IDs or page IDs.  The Data API is set up to require OAuth2 user authentication for security reasons.

1.1 OAuth2 Authentication

The Data API uses a simple version of the OAuth2 authentication, where a user that is already registered with our OAuth2 server provides the username and password to the OAuth2 server to retrieve a token.  The token is then added to the HTTP request “Authorization” header in subsequent calls to the Data API.  Because the username and password (and later the token) are sent as clear text in the request, the OAuth2 server and Data API requires HTTPS, not HTTP, to prevent eavesdropping.

To send a request to the OAuth2 server, a client sends a “POST” HTTP request to the OAuth2 server endpoint:

https://<oauth2-server>:<port>/oauth2/token?grant_type=client_credentials&client_id=<username>&client_secret=<password>

Make sure the grant_type is set to “client_credentials”.  In addition, the request also requires the following HTTP request header:

content-type: application/x-www-form-urlencoded

as well as a request body that literally says

null

If the authentication is correct, the client receives a 200_OK status code, and the body of the response is a JSON string:

{“expires_in” : “3600”, “access_token” : <authentication_token>}

The expires_in property shows the length of time for which the token is valid in seconds, and the access_token property is the token string that must be added to the HTTP request header for subsequent calls to the Data API (there is one caveat, however.  See the next section for details).

1.2 Request to Data API

The Data API currently supports two resource types: volumes, and pages, with an optional parameter to control how the texts are aggregated.  To get resources, the client sends an HTTP “POST” request to the Data API service.  The request must also have the following HTTP request header:

Content-type: application/x-www-form-urlencoded

1.2.1 Inserting the OAuth2 authentication token

The client adds an HTTP “Authorization” header to the HTTP request as follows:

Authorization: Bearer <authentication_token>

Be sure to prepend the string “Bearer “ (with a space at the end) to the authentication token before setting it to the Authorization header.  The OAuth2 protocol requires it.

1.2.2 Requesting volumes

To request for volumes, set the request URL to:

https://<server>:<port>/data-api/volumes

and the request body to a URL encoded "volumeIDs" parameter string:

volumeIDs=<volumeID_list>

<volumeID_list> := <volumeID> [“|” <volumeID> [“|”...]]

<volumeID> := <prefix>“.”<ID_string>


Example:

volumeIDs=inu.3011012%7Cuc2.ark%3A%2F13960%2Ft2qxv15

The volumeID list consists of one or more HathiTrust proprietary volumeIDs separated by the pipe character “|”.  Each volumeID consists of a prefix, which is used by HathiTrust to identify the originating institution of the volume, followed by a dot symbol “.”, followed by a string ID that is defined by each contributing institution.  The string IDs from different institutions vary greatly, ranging from fixed-length zero-padded numeric IDs, to ARK DOIs containing filesystem unsafe characters such as colons ":" and slashes "/".

For the purpose of making a request to the Data API, the client can simply view each volumeID as a string, but must perform necessary URL encoding on the volumeID list so it conforms to the application/x-www-form-urlencoded content-type.  For processing the responses from the Data API, the client may find it necessary to separate the prefix and perform some string manipulation on the volumeIDs (please see more details on this in subsection 1.2.3).

The optional parameter “concat” controls whether the pages of a volume should be aggregated, as discussed in section 1.2.3.  If omitted, the default value is false.

1.2.3 Response from volume request

A correct request to the Data API receives a 200_OK status, and the body of the response is a binary ZIP stream, with the MIME type of “application/zip”.

If the request was sent with concat=false, the returned ZIP file would have the following structure:

volumes.zip
  <cleaned_volumeID_1>/
      00000001.txt
      00000002.txt
      …
      000nnnnn.txt
  <cleaned_volumeID_2>/
      00000001.txt
      00000002.txt
      …
      000xxxxx.txt
  …
  <cleaned_volumeID_n>/
      000000001.txt
      …
      00000zzzz.txt
  ERROR.err

In the ZIP file, each requested volume is in its own directory, and the name of the directory is a “cleaned” mutation of the original volumeID in a filesystem safe format.  Each page of the volume is a text file named by its page sequence as an eight-digit fixed-length zero-padded number.

<cleaned_volumeID> := <prefix>“.”<cleaned_ID_string>

A “cleaned” volumeID consists of the institution-identifying prefix, followed by the dot character ".", followed by the “cleaned” version of the original ID string.  This “cleaning” procedure is defined in the pairtree specification (https://wiki.ucop.edu/display/Curation/PairTree) so the directory name is filesystem safe.  However, it is worth pointing out that the prefix does not undergo the cleaning procedure.  This is because the prefix is not a part of the pairtree structure.

If an error occurs after the Data API has started sending the stream, the Data API injects a special ERROR.err file into the ZIP stream and then properly terminates the stream to prevent corruption.  ERROR.err contains information on the error.  If this file is present, the ZIP file may not contain all requested volumes/pages.   

On the other hand, if the request was sent with concat=true, the returned ZIP would have the following structure:

volumes.zip
   <cleaned_volumeID_1>.txt
   <cleaned_volumeID_2>.txt
   …
   <cleaned_volumeID_n>.txt
   ERROR.err

in this case, pages belong to each volume are no longer individual text files.  Instead, they are concatenated into a single text file, and the name of the text file is the cleaned volume ID with .txt extension.

If the entry ERROR.err is present, there was an issue retrieving some resources, and the ZIP stream was terminated prematurely to prevent corruption.  Some requested resources may be missing.

1.2.4 Requesting pages

Use the following command to request specific pages from a book instead of retrieving the entire volume:

https://<server>:<port>/data-api/pages?pageIDs=<pageID_list>[&concat=true]

where

<pageID_list> := <pageID> [“|” <pageID> [“|”...]]
<pageID> := <volumeID>“[”<pageID_1> [“,”<pageID_2 [“,”...]]“]

Example:
https://silvermaple.pti.indiana.edu:25443/data-api/pages?pageIDs=inu.3011012[1,2,20,30]|uc2.ark:/13960/t2qxv15[11,45,30,17,22,55]

A pageID list consists of one or more pageIDs separated by the pipe character “|”.  Each pageID is a volumeID followed by a comma-separated list of page sequence numbers enclosed in square brackets.  The client may find it necessary to perform URL encoding on the pageID list so all characters can be safely passed as query parameters.

The optional parameter “concat” controls how these pages should be returned.  If omitted, the default value for concat is false.

1.2.5 Response from page request

A correct request returns a 200_OK status, and the body of the response is a binary ZIP stream with the MIME type of “application/zip”.

If the request was sent with concat=false, the returned ZIP file would have the following structure:

pages.zip
   <cleaned_volumeID_1>/
       0000000n.txt
       000000xx.txt
       …
   <cleaned_volumeID_2>/
       0000000p.txt
       0000wwww.txt
       …
   …
   <cleaned_volumeID_n>/
       0000qqqq.txt
       000zzzzz.txt
       …
   ERROR.err

Each volume has its own directory, with the requested pages as individual text files under the directory.  The name of the directory is the cleaned volumeID, and the name of each page text file is the eight-digit fixed-length zero-padded page sequence with .txt extension (see Section 1.2.4 for details on the cleaned volumeIDs).

If the Data API encounters a problem during the retrieval, it terminates the ZIP stream to prevent corruption, but not before it injects an ERROR.err file into the stream which contains more information on the error.  If ERROR.err is present, the user can assume the ZIP stream does not contain all the resources requested.

If the request was sent with concat=true, the returned ZIP stream has the following structure:

pages.zip
   wordbag.txt
   ERROR.err

in this case, all requested pages are concatenated into a single text file regardless of which volume a page is from.  It essentially turns into a “bag of words”, thus the filename wordbag.txt

If there is an error, the Data API terminates the ZIP stream and injects ERROR.err with information on the problem encountered.  Should ERROR.err be present, the received ZIP stream is missing some requested resources.

1.2.6 Errors

The Data API internally retrieves volume and page data asynchronously from the back-end data store for maximum performance.  If an error occurs before any internal retrieval requests are sent to the back-end data store (e.g. a malformed volumeID or pageID in the client request), the Data API returns an error HTTP response status to the client, an a short description of the error as the response body, and no volume or page data is be returned.  However, if an error occurs after the internal retrieval requests are sent (e.g. a non-existent volumeID or pageID), the Data API returns a 200_OK HTTP response status and a ZIP stream containing volume and page data as the response body.  The ZIP stream will also contain an ERROR.err entry with details on the error.  If multiple errors occurred, only the first error is returned – in fact, the Data API hold the error until all internal asynchronous requests are finished (either successfully or failed on other errors) and then injects ERROR.err as the last entry to provide maximum tolerance and flexibility to its client.

The table below lists the possible error status and response body a client may receive, as well as the meaning of the error.

Response StatusResponse BodyReason

400 Bad Request

<p>Missing required parameter volumeIDs</p>request for volumes does not have volumeIDs query parameter
400 Bad Request<p>Missing required parameter pageIDs</p>request for pages does not have pageIDs query parameter
400 Bad Request<p>Malformed volume ID list. Offending token: xxx</p>volumeID list contain tokens that are not valid volumeIDs and cannot be parsed
400 Bad Request<p>Malformed page ID list. Offending token: xxx</p>pageID list contains tokens that are not valid pageIDs and cannot be parsed
400 Bad Request<p>Request too greedy. Request violates Max Volumes Allowed xxx. Offending ID: zzz</p>request would touch and retrieve more volumes than allowed by the policy.  In the response body, xxx is the limit set by the policy, and zzz is the first volumeID that exceeds the limit.  Applicable if the policy is set on the server.
400 Bad Request<p>Request too greedy. Request violates Max Total Pages Allowed xxx. Offending ID: zzz</p>request would touch and retrieve more pages than allowed by the policy. In the response body, xxx is the limit set by the policy, and zzz is the first pageID that exceeds the limit. Applicable if the policy is set on the server.
400 Bad Request<p>Request too greedy. Request violates Max Pages Per Volume Allowed xxx. Offending ID: zzz</p>request would touch and retrieve more pages per volume than allowed by the policy. In the response body, xxx is the limit set by the policy, and zzz is the first ID that exceeds the limit. Applicable if the policy is set on the server
404 Not Found<p>Key not found. Offending key: xxx</p>request asks for a non-existent volumeID, or asks for a non-existent page sequence of a valid volume (e.g. asking for page 100 of a volume with only 90 pages)
500 Internal Server Error<p>Server too busy.</p>Data API gets a timed out exception while trying to retrieve data.
500 Internal Server Error<p>Internal server error.</p>other exceptions occurred when Data API tries to retrieve data.