Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Request DescriptionZip Structure LayoutExplanation of Entries

Retrieve volumes

concat=false

 

 

Because the request parameter concat=false, each volume has its own directory, and the pages and metadata documents of each volume are individual files stored under the volume directory.

foo.001122/ is a directory named after the first volume, foo.001122.  The directory name underwent a Pairtree clean process, but since it does not contain any filesystem unsafe characters, the cleaned ID looks the same as the original.

Inside foo.001122/, files 00000001.txt through 00000005.txt are the 5 pages of this volume

mets.xml will also be inside of foo.001122/ if the request parameter mets=true

bar.ark+=13960=t123/ is a directory named after the second volume, bar.ark:/13960/t123.  The directory name underwent a Pairtree clean process so that filesystem-unsafe characters such as colons ':' and slashes '/' are escaped and replaced with filesystem-safe characters.

Inside bar.ark+=13960=t123/, files 00000001.txt through 00000003.txt are the 3 pages of this volume.

mets.xml will also be inside of bar.ark+=13960=t123/ if the request parameter mets=true

ERROR.err is a file at the top level.  It is represent present if the request encountered some errors and the detailed error information is stored in this file.

Retrieve volumes

concat=true

Because the request parameter concat=true, each volume is a single text file, where the pages of the volume are concatenated into the file in the page order.

foo.001122.txt is the text file entry for the volume foo.001122.  The filename underwent a Pairtree clean process, but since it does not contain any filesystem unsafe characters, the cleaned ID looks identical to the original.

foo.001122.mets.xml will be present if the request parameter mets=true.

bar.ark+=13960=t123.txt is the text file entry for the volume bar.ark:/13960/t123.  The filename underwent a Pairtree clean process, so filesystem-unsafe characters such as colons ':' and slashes '/' are replaced with filesystem-safe characters.

bar.ark+=13960=t123.mets.xml will be present if the request parameter mets=true.

ERROR.err is a file at the top level.  It is represent present if the request encountered some errors and the detailed error information is stored in this file.

 

 

 

Retrieve pages

concat=false

The Zip stream returned from the Data API for page retrieval with the request parameter concat=false is very similar to that returned for volume retrieval with concat=false

 

 

 

1.1 OAuth2 Authentication

...

1.2 Request to Data API

The Data API currently supports two resource types: volumes, and pages, with an optional parameter to control how the texts are aggregated.  To get resources, the client sends an HTTP “POST” request to the Data API service.  The request must also have the following HTTP request header:

Content-type: application/x-www-form-urlencoded

1.2.1 Inserting the OAuth2 authentication token

...

1.2.2 Requesting volumes

...

and the request body to a URL encoded "volumeIDs" parameter string:

volumeIDs=<volumeID_list>[&concat=true]

<volumeID_list> := <volumeID> [“|” <volumeID> [“|”...]]

<volumeID> := <prefix>“.”<ID_string>

...

1.2.3 Response from volume request

...

1.2.4 Requesting pages

...

and the request body to a URL encoded "pageIDs" parameter string:

pageIDs=<pageID_list>[&concat=true]

<pageID_list> := <pageID> [“|” <pageID> [“|”...]]

<pageID> := <volumeID>“[”<pageID_1> [“,”<pageID_2 [“,”...]]“]”

 

...

1.2.5 Response from page request

...

1.2.6 Errors

...

Response StatusResponse BodyReason

400 Bad Request

<p>Missing required parameter volumeIDs</p>request for volumes does not have volumeIDs query parameter
400 Bad Request<p>Missing required parameter pageIDs</p>request for pages does not have pageIDs query parameter
400 Bad Request<p>Malformed volume ID list. Offending token: xxx</p>volumeID list contain tokens that are not valid volumeIDs and cannot be parsed
400 Bad Request<p>Malformed page ID list. Offending token: xxx</p>pageID list contains tokens that are not valid pageIDs and cannot be parsed
400 Bad Request<p>Request too greedy. Request violates Max Volumes Allowed xxx. Offending ID: zzz</p>request would touch and retrieve more volumes than allowed by the policy.  In the response body, xxx is the limit set by the policy, and zzz is the first volumeID that exceeds the limit.  Applicable if the policy is set on the server.
400 Bad Request<p>Request too greedy. Request violates Max Total Pages Allowed xxx. Offending ID: zzz</p>request would touch and retrieve more pages than allowed by the policy. In the response body, xxx is the limit set by the policy, and zzz is the first pageID that exceeds the limit. Applicable if the policy is set on the server.
400 Bad Request<p>Request too greedy. Request violates Max Pages Per Volume Allowed xxx. Offending ID: zzz</p>request would touch and retrieve more pages per volume than allowed by the policy. In the response body, xxx is the limit set by the policy, and zzz is the first ID that exceeds the limit. Applicable if the policy is set on the server
404 Not Found<p>Key not found. Offending key: xxx</p>request asks for a non-existent volumeID, or asks for a non-existent page sequence of a valid volume (e.g. asking for page 100 of a volume with only 90 pages)
500 Internal Server Error<p>Server too busy.</p>Data API gets a timed out exception while trying to retrieve data.
500 Internal Server Error<p>Internal server error.</p>other exceptions occurred when Data API tries to retrieve data.The difference is that only pages requested for will be included.

Retrieve pages

concat=true

Because the request parameter concat=true, the returned Zip stream is a "sequence of words" where the content of all pages from all volumes is aggregated into a single text file entry named wordseq.txt.

ERROR.err is a file at the top level.  It is present if the request encountered some errors and the detailed error information is stored in this file.

Note that there is no METS metadata returned because mixing METS metadata and page content into the word sequence could potentially contaminate the information in the word sequence file.

Additional Information

While the Data API by itself does not enforce any security mechanism for authentication and/or authorization, it is typically deployed behind an OAuth2 Servlet Filter.  A client making request to the Data API through the OAuth2 Servlet Filter must first obtain a valid OAuth2 token from the token service, and present the token as an additional HTTP request header to the OAuth2 Servlet Filtered Data API.  Please refer to "Using WSO2 Identity Server as the OAuth2 Provider for HTRC" for details on the usage.