Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 19 Next »

VersionStatusMaturityComments
1.0-SNAPSHOTUnder testingRelease candidate 

UNDER CONSTRUCTION FOR NEW VERSION OF DATA API

Synopsis

The HTRC Data API is a RESTful web service for the retrieval of multiple volumes, pages of volumes, and METS metadata documents.

API

Retrieve Volumes

DescriptionReturns requested volumes
URL
/volumes
Supported Response Types

application/zip (normal response)

text/html (error response)

MethodPOST
Request Types
application/x-www-form-urlencoded
Request Headers
Content-Type: application/x-www-form-urlencoded
Request BodyRequest parameters as body content.  See Parameters below
Parameters

All parameter values must be URL encoded

NameDescriptionTypeDefault valueRequiredNote
volumeIDsThe list of volumeIDs to be retrieved.stringN/AyesVolumeIDs are separated by the pipe character '|'
concatThe flag to indicate concatenation option.booleanfalsenoSee section on response format for details on its impact on the returned data
metsThe flag to indicate if METS document should be returnedbooleanfalseno 
versionThe specific version of the Data API to usestringN/AnoNot implemented.  Place holder only
Responses
HTTP Status CodeResponse BodyResponse TypeDescription
200 (ok)A binary Zip stream
application/zip
Page content and metadata of the requested volumes aggregated as a Zip stream
400 (bad request)
<p>Missing required parameter volumeIDs</p>
text/html
The required parameter volumeIDs is missing in the request
400 (bad request)
<p>Malformed Volume ID list. Offending token: ${token}</p>
text/html

The value for volumeIDs is malformed and the Data API cannot parse it.  ${token} will be the token that causes the error.

Example
Description

Request for volumes inu.3011012 and uc2.ark:/13960/t2qxv15, with concatenation option enabled so each volume is a single text file in the returned Zip stream.

Raw volumeIDs

inu.3011012|uc2.ark:/13960/t2qxv15

URL encoded request bodyvolumeIDs=inu.3011012%7Cuc2.ark%3A%2F13960%2Ft2qxv15&concat=true

Retrieve Pages

DescriptionReturns requested pages
URL/pages
Supported Response Types

application/zip (normal response)

text/html (error response)

MethodPOST
Request Types
application/x-www-form-urlencoded
Request Headers
Content-Type: application/x-www-form-urlencoded
Request BodyRequest parameters as body content.  See Parameters below
Parameters

All parameter values must be URL encoded

NameDescriptionTypeDefault valueRequiredNote
pageIDsThe list of pageIDs to be retrievedstringN/AyesPageIDs are separated by the pipe character '|'
concatThe flag to indicate concatenation optionbooleanfalseno

See section on response format for details on its impact on the returned data

"concat" and "mets" cannot be both set

metsThe flag to indicate if METS documents should be returnedbooleanfalseno

"concat" and "mets" cannot be both set

versionThe specific version of the Data API to usestringN/AnoNot implemented.  Place holder only
Responses
HTTP Status CodeResponse BodyResponse TypeDescription
200 (ok)A binary Zip streamapplication/zipPage content and metadata of the requested pages aggregated as a Zip stream
400 (bad request)<p>Missing required parameter pageIDs</p>text/htmlThe required parameter volumeIDs is missing in the request
400 (bad request)<p>Malformed Page ID list. Offending token: ${token}</p>text/htmlThe value for pageIDs is malformed and the Data API cannot parse it.  ${token} will be the token that caused the error.
400 (bad request)<p>Conflicting parameters in page retrieval. Offending Parameters: ${param1}, ${param2}</p>text/htmlSome request parameters have conflict. ${param1} and ${param2} will be the names of the parameters that caused the conflict. In the current version of the Data API, this is most likely caused by setting both "mets" and "concat" for page retrieval.
Example
Description

Request for the 1st, 2nd, 20th, and 30th pages of the volume inu.3011012, and the 11th, 17th, 22th, 30th, 45th, and 55th pages of the volume uc2.ark:/13960/t2qxv15, with each page being a separate text file along with the corresponding METS document of each volume in the returned Zip stream.

Raw pageIDsinu.3011012[1,2,20,30]|uc2.ark:/13960/t2qxv15[11,45,30,17,22,55]
URL encoded request bodypageIDs=inu.3011012%5B1%2C2%2C20%2C30%5D%7Cuc2.ark%3A%2F13960%2Ft2qxv15%5B11%2C45%2C30%2C17%2C22%2C55%5D&mets=true

Zip Structure Layout

The directory structure layout of the Zip stream returned from the Data API may be one of the following patterns depending on the optional parameters:

Strictly speaking, inside a Zip file the structure is flat, so there is no "directories" but only file entries.  However, in practice almost all Zip tools give the illusion of directories by leveraging the slash characters '/' in the name of each Zip entry.  For the discussion here, we follow such practice and treat the inside of a Zip file as if it were a conventional filesystem.

Suppose there are 2 hypothetical volumes in the corpus: foo.001122, which has 5 pages, and bar.ark:/13960/t123, which has 3 pages.  Both volumes also have the associated METS xml files.  The client tries to request for these 2 volumes, and also tries to request for another volume gon.000000 which no longer exists in the corpus.

Request DescriptionZip Structure LayoutExplanation of Entries

Retrieve volumes

concat=false

 

 

foo.001122/ is a directory named after the first volume, foo.001122.  The directory name underwent a Pairtree clean process, but since it does not contain any filesystem unsafe characters, the cleaned ID looks the same as the original

inside foo.001122/, files 00000001.txt through 00000005.txt are the 5 pages of this volume

mets.xml will also be inside of foo.001122/ if the request parameter mets=true

bar.ark+=13960=t123/ is a directory named after the second volume, bar.ark:/13960/t123.  The directory name underwent a Pairtree clean process so that filesystem-unsafe characters such as colons ':' and slashes '/' are escaped and replaced with filesystem-safe characters.

inside bar.ark+=13960=t123/, files 00000001.txt through 00000003.txt are the 3 pages of this volume

mets.xml will also be inside of bar.ark+=13960=t123/ if the request parameter mets=true

ERROR.err is a file at the top level.  It is represent if the request encountered some errors and the detailed error information is stored in this file.

 

 

 

1.1 OAuth2 Authentication

The Data API uses a simple version of the OAuth2 authentication, where a user that is already registered with our OAuth2 server provides the username and password to the OAuth2 server to retrieve a token.  The token is then added to the HTTP request “Authorization” header in subsequent calls to the Data API.  Because the username and password (and later the token) are sent as clear text in the request, the OAuth2 server and Data API requires HTTPS, not HTTP, to prevent eavesdropping.

To send a request to the OAuth2 server, a client sends a “POST” HTTP request to the OAuth2 server endpoint:

https://<oauth2-server>:<port>/oauth2/token?grant_type=client_credentials&client_id=<username>&client_secret=<password>

Make sure the grant_type is set to “client_credentials”.  In addition, the request also requires the following HTTP request header:

content-type: application/x-www-form-urlencoded

as well as a request body that literally says

null

If the authentication is correct, the client receives a 200_OK status code, and the body of the response is a JSON string:

{“expires_in” : “3600”, “access_token” : <authentication_token>}

The expires_in property shows the length of time for which the token is valid in seconds, and the access_token property is the token string that must be added to the HTTP request header for subsequent calls to the Data API (there is one caveat, however.  See the next section for details).

1.2 Request to Data API

The Data API currently supports two resource types: volumes, and pages, with an optional parameter to control how the texts are aggregated.  To get resources, the client sends an HTTP “POST” request to the Data API service.  The request must also have the following HTTP request header:

Content-type: application/x-www-form-urlencoded

1.2.1 Inserting the OAuth2 authentication token

The client adds an HTTP “Authorization” header to the HTTP request as follows:

Authorization: Bearer <authentication_token>

Be sure to prepend the string “Bearer “ (with a space at the end) to the authentication token before setting it to the Authorization header.  The OAuth2 protocol requires it.

1.2.2 Requesting volumes

To request for volumes, set the request URL to:

https://<server>:<port>/data-api/volumes

and the request body to a URL encoded "volumeIDs" parameter string:

volumeIDs=<volumeID_list>[&concat=true]

<volumeID_list> := <volumeID> [“|” <volumeID> [“|”...]]

<volumeID> := <prefix>“.”<ID_string>


Example:

volumeIDs=inu.3011012%7Cuc2.ark%3A%2F13960%2Ft2qxv15

The volumeID list consists of one or more HathiTrust proprietary volumeIDs separated by the pipe character “|”.  Each volumeID consists of a prefix, which is used by HathiTrust to identify the originating institution of the volume, followed by a dot symbol “.”, followed by a string ID that is defined by each contributing institution.  The string IDs from different institutions vary greatly, ranging from fixed-length zero-padded numeric IDs, to ARK DOIs containing filesystem unsafe characters such as colons ":" and slashes "/".

For the purpose of making a request to the Data API, the client can simply view each volumeID as a string, but must perform necessary URL encoding on the volumeID list so it conforms to the application/x-www-form-urlencoded content-type.  For processing the responses from the Data API, the client may find it necessary to separate the prefix and perform some string manipulation on the volumeIDs (please see more details on this in subsection 1.2.3).

The optional parameter “concat” controls whether the pages of a volume should be aggregated, as discussed in section 1.2.3.  If omitted, the default value is false.

1.2.3 Response from volume request

A correct request to the Data API receives a 200_OK status, and the body of the response is a binary ZIP stream, with the MIME type of “application/zip”.

If the request was sent with concat=false, the returned ZIP file would have the following structure:

volumes.zip
  <cleaned_volumeID_1>/
      00000001.txt
      00000002.txt
      …
      000nnnnn.txt
  <cleaned_volumeID_2>/
      00000001.txt
      00000002.txt
      …
      000xxxxx.txt
  …
  <cleaned_volumeID_n>/
      000000001.txt
      …
      00000zzzz.txt
  ERROR.err

In the ZIP file, each requested volume is in its own directory, and the name of the directory is a “cleaned” mutation of the original volumeID in a filesystem safe format.  Each page of the volume is a text file named by its page sequence as an eight-digit fixed-length zero-padded number.

<cleaned_volumeID> := <prefix>“.”<cleaned_ID_string>

A “cleaned” volumeID consists of the institution-identifying prefix, followed by the dot character ".", followed by the “cleaned” version of the original ID string.  This “cleaning” procedure is defined in the pairtree specification (https://wiki.ucop.edu/display/Curation/PairTree) so the directory name is filesystem safe.  However, it is worth pointing out that the prefix does not undergo the cleaning procedure.  This is because the prefix is not a part of the pairtree structure.

If an error occurs after the Data API has started sending the stream, the Data API injects a special ERROR.err file into the ZIP stream and then properly terminates the stream to prevent corruption.  ERROR.err contains information on the error.  If this file is present, the ZIP file may not contain all requested volumes/pages.   

On the other hand, if the request was sent with concat=true, the returned ZIP would have the following structure:

volumes.zip
   <cleaned_volumeID_1>.txt
   <cleaned_volumeID_2>.txt
   …
   <cleaned_volumeID_n>.txt
   ERROR.err

in this case, pages belong to each volume are no longer individual text files.  Instead, they are concatenated into a single text file, and the name of the text file is the cleaned volume ID with .txt extension.

If the entry ERROR.err is present, there was an issue retrieving some resources, and the ZIP stream was terminated prematurely to prevent corruption.  Some requested resources may be missing.

1.2.4 Requesting pages

To request for pages, set the request URL to :

https://<server>:<port>/data-api/pages

and the request body to a URL encoded "pageIDs" parameter string:

pageIDs=<pageID_list>[&concat=true]

<pageID_list> := <pageID> [“|” <pageID> [“|”...]]

<pageID> := <volumeID>“[”<pageID_1> [“,”<pageID_2 [“,”...]]“]”

 

Example:
pageIDs=inu.3011012%5B1%2C2%2C20%2C30%5D%7Cuc2.ark%3A%2F13960%2Ft2qxv15%5B11%2C45%2C30%2C17%2C22%2C55%5D
A pageID list consists of one or more pageIDs separated by the pipe character “|”.  Each pageID is a volumeID followed by a comma-separated list of page sequence numbers enclosed in square brackets.  The client must perform URL encoding on the pageID list so it conforms to the application/x-www-form-urlencoded content-type.

The optional parameter “concat” controls how these pages should be returned.  If omitted, the default value for concat is false.

1.2.5 Response from page request

A correct request returns a 200_OK status, and the body of the response is a binary ZIP stream with the MIME type of “application/zip”.

If the request was sent with concat=false, the returned ZIP file would have the following structure:

pages.zip
   <cleaned_volumeID_1>/
       0000000n.txt
       000000xx.txt
       …
   <cleaned_volumeID_2>/
       0000000p.txt
       0000wwww.txt
       …
   …
   <cleaned_volumeID_n>/
       0000qqqq.txt
       000zzzzz.txt
       …
   ERROR.err

Each volume has its own directory, with the requested pages as individual text files under the directory.  The name of the directory is the cleaned volumeID, and the name of each page text file is the eight-digit fixed-length zero-padded page sequence with .txt extension (see Section 1.2.4 for details on the cleaned volumeIDs).

If the Data API encounters a problem during the retrieval, it terminates the ZIP stream to prevent corruption, but not before it injects an ERROR.err file into the stream which contains more information on the error.  If ERROR.err is present, the user can assume the ZIP stream does not contain all the resources requested.

If the request was sent with concat=true, the returned ZIP stream has the following structure:

pages.zip
   wordbag.txt
   ERROR.err

in this case, all requested pages are concatenated into a single text file regardless of which volume a page is from.  It essentially turns into a “bag of words”, thus the filename wordbag.txt

If there is an error, the Data API terminates the ZIP stream and injects ERROR.err with information on the problem encountered.  Should ERROR.err be present, the received ZIP stream is missing some requested resources.

1.2.6 Errors

The Data API internally retrieves volume and page data asynchronously from the back-end data store for maximum performance.  If an error occurs before any internal retrieval requests are sent to the back-end data store (e.g. a malformed volumeID or pageID in the client request), the Data API returns an error HTTP response status to the client, an a short description of the error as the response body, and no volume or page data is be returned.  However, if an error occurs after the internal retrieval requests are sent (e.g. a non-existent volumeID or pageID), the Data API returns a 200_OK HTTP response status and a ZIP stream containing volume and page data as the response body.  The ZIP stream will also contain an ERROR.err entry with details on the error.  If multiple errors occurred, only the first error is returned – in fact, the Data API hold the error until all internal asynchronous requests are finished (either successfully or failed on other errors) and then injects ERROR.err as the last entry to provide maximum tolerance and flexibility to its client.

The table below lists the possible error status and response body a client may receive, as well as the meaning of the error.

Response StatusResponse BodyReason

400 Bad Request

<p>Missing required parameter volumeIDs</p>request for volumes does not have volumeIDs query parameter
400 Bad Request<p>Missing required parameter pageIDs</p>request for pages does not have pageIDs query parameter
400 Bad Request<p>Malformed volume ID list. Offending token: xxx</p>volumeID list contain tokens that are not valid volumeIDs and cannot be parsed
400 Bad Request<p>Malformed page ID list. Offending token: xxx</p>pageID list contains tokens that are not valid pageIDs and cannot be parsed
400 Bad Request<p>Request too greedy. Request violates Max Volumes Allowed xxx. Offending ID: zzz</p>request would touch and retrieve more volumes than allowed by the policy.  In the response body, xxx is the limit set by the policy, and zzz is the first volumeID that exceeds the limit.  Applicable if the policy is set on the server.
400 Bad Request<p>Request too greedy. Request violates Max Total Pages Allowed xxx. Offending ID: zzz</p>request would touch and retrieve more pages than allowed by the policy. In the response body, xxx is the limit set by the policy, and zzz is the first pageID that exceeds the limit. Applicable if the policy is set on the server.
400 Bad Request<p>Request too greedy. Request violates Max Pages Per Volume Allowed xxx. Offending ID: zzz</p>request would touch and retrieve more pages per volume than allowed by the policy. In the response body, xxx is the limit set by the policy, and zzz is the first ID that exceeds the limit. Applicable if the policy is set on the server
404 Not Found<p>Key not found. Offending key: xxx</p>request asks for a non-existent volumeID, or asks for a non-existent page sequence of a valid volume (e.g. asking for page 100 of a volume with only 90 pages)
500 Internal Server Error<p>Server too busy.</p>Data API gets a timed out exception while trying to retrieve data.
500 Internal Server Error<p>Internal server error.</p>other exceptions occurred when Data API tries to retrieve data.



  • No labels