The Data API uses a simple version of the OAuth2 authentication, where a user that is already registered with our OAuth2 server provides the username and password to the OAuth2 server to retrieve a token. The token is then added to the HTTP request “Authorization” “
Authorization” header in subsequent calls to the Data API. Because the username and password (and later the token) are sent as clear text in the request, the OAuth2 server and Data API requires HTTPS, not HTTP, to prevent eavesdropping.
To send a request to the OAuth2 server, a client sends a “POST” “
POST” HTTP request to the OAuth2 server endpoint:
Make sure the
grant_type is set to “client“
credentials”. In addition, the request also requires the following HTTP request header:
as well as a request body that literally says
If the authentication is correct, the client receives a
200_OK status code, and the body of the response is a JSON string:
in” : “3600”, “access
token” : <authentication_token>}
expires_in property shows the length of time for which the token is valid in seconds, and the
access_token property is the token string that must be added to the HTTP request header for subsequent calls to the Data API (there is one caveat, however. See the next section for details).
The Data API currently supports two resource types: volumes, and pages, with an optional parameter to control how the texts are aggregated. To get resources, the client sends an HTTP “
GET” request to the Data API service.
The client adds an HTTP “
Authorization” header to the HTTP request as follows:
Authorization: Bearer <authentication_token>
Be sure to prepend the string “"
Bearer “ " (with a space at the end) to the authentication token before setting it to the Authorization header. The OAuth2 protocol requires it.
Use the following command to request volumes:
<volumeID_list> := <volumeID> [“|” <volumeID> [“|”...]]
<volumeID> := <prefix>“.”<ID_string>
The volumeID list consists of one or more Hathitrust HathiTrust proprietary volumeIDs separated by the pipe character “
|”. Each volumeID consists of a prefix, which is used by Hathitrust HathiTrust to identify the originating institution of the volume, followed by a dot symbol “
.”, followed by a string ID that is defined by each contributing institution. The string IDs from different institutions vary greatly, ranging from fixed-length zero-padded numeric IDs, to ARK DOIs containing filesystem unsafe characters such as colons ‘"
:’ " and slashes ‘"
For the purpose of making a request to the Data API, the client can simply view each volumeID as a string, but may need to perform necessary URL encoding on the volumeID so it can be safely passed as a query parameter. For processing the responses from the Data API, the client may find it necessary to separate the prefix and perform some string manipulation on the volumeIDs (please see more details on this in subsection 1.2.3).
The optional parameter “
concat” controls whether the pages of a volume should be aggregated, as discussed in section 1.2.3. The If omitted, the default value is
1.2.3 Response from volume request
A correct request to the Data API receives a
200_OK status, and the body of the response is a binary ZIP stream, with the MIME type of “
If the request was sent with
concat=false, the returned ZIP file would have the following structure:
In the ZIP file, each requested volume is in its own directory, and the name of the directory is a “cleaned” mutation of the original volumeID in a filesystem safe format. Each page of the volume is a text file named by its page sequence as an eight-digit fixed-length zero-padded number.
<cleaned_volumeID> := <prefix>“.”<cleaned_ID_string>
A “cleaned” volumeID consists of the institution-identifying prefix, followed by the dot character ‘.’, followed by the “cleaned” version of the original ID string. This “cleaning” procedure is defined in the pairtree specification (https://wiki.ucop.edu/display/Curation/PairTree) so the directory name is filesystem safe. However, it is worth pointing out that the prefix does not undergo the cleaning procedure. This is because the prefix is not a part of the pairtree structure.
If an error occurs after the Data API has started sending the stream, the Data API injects a special ERROR.err file into the ZIP stream and then properly terminates the stream to prevent corruption. ERROR.err contains information on the error. If this file is present, the ZIP file may not contain all requested volumes/pages.
On the other hand, if the request was sent with concat=true, the returned ZIP would have the following structure:
in this case, pages belong to each volume are no longer individual text files. Instead, they are concatenated into a single text file, and the name of the text file is the cleaned volume ID with .txt extension.
If the entry ERROR.err is present, there was an issue retrieving some resources, and the ZIP stream was terminated prematurely to prevent corruption. Some requested resources may be missing.
1.2.4 Requesting pages
Use the following command to request specific pages from a book instead of retrieving the entire volume:
<pageID_list> := <pageID> [“|” <pageID> [“|”...]]
<pageID> := <volumeID>“<”<pageID_1> [“,”<pageID_2 [“,”...]]“>”
A pageID list consists of one or more pageIDs separated by the pipe character “|”. Each pageID is a volumeID followed by a comma-separated list of page sequence numbers enclosed in angle brackets. The client may find it necessary to perform URL encoding on the pageID list so all characters can be safely passed as query parameters.
The optional parameter “concat” controls how these pages should be returned. If omitted, the default value for concat is false.