Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


1. HTRC Data API

The HTRC Data API is a RESTful web service allowing clients to retrieve digital text by providing volume IDs or page IDs.  The Data API is set up to require OAuth2 user authentication for security reasons.

1.1 OAuth2 Authentication

The Data API uses a simple version of the OAuth2 authentication, where a user that is already registered with our OAuth2 server provides the username and password to the OAuth2 server to retrieve a token.  The token is then added to the HTTP request “Authorization” header in subsequent calls to the Data API.  Because the username and password (and later the token) are sent as clear text in the request, the OAuth2 server and Data API requires HTTPS, not HTTP, to prevent eavesdropping.
 
To send a request to the OAuth2 server, a client sends a “POST” HTTP request to the OAuth2 server endpoint:

https://<oauth2-server>:<port>/oauth2/token?grant_type=client_credentials&client_id=<username>&client_secret=<password>

Make sure the grant_type is set to “client_credentials”.  In addition, the request also requires the following HTTP request header:

content-type: application/x-www-form-urlencoded

as well as a request body that literally says

null

If the authentication is correct, the client receives a 200_OK status code, and the body of the response is a JSON string:

{“expires_in” : “3600”, “access_token” : <authentication_token>}

The expires_in property shows the length of time for which the token is valid in seconds, and the access_token property is the token string that must be added to the HTTP request header for subsequent calls to the Data API (there is one caveat, however.  See the next section for details).

1.2 Request to Data API

The Data API currently supports two resource types: volumes, and pages, with an optional parameter to control how the texts are aggregated.  To get resources, the client sends an HTTP “GET” request to the Data API service.

1.2.1 Inserting the OAuth2 authentication token

The client adds an HTTP “Authorization” header to the HTTP request as follows:

Authorization: Bearer <authentication_token>

Be sure to prepend the string "Bearer " (with a space at the end) to the authentication token before setting it to the Authorization header.  The OAuth2 protocol requires it.

1.2.2 Requesting volumes

Use the following command to request volumes:

https://<server>:<port>/data-api/volumes?volumeIDs=<volumeID_list>[&concat=true]

where

<volumeID_list> := <volumeID> [“|<volumeID> [“|...]]
<volumeID> := <prefix>“.”<ID_string>
 
Example:

https://silvermaple.pti.indiana.edu:25443/data-api/volumes?volumeIDs=inu.3011012|uc2.ark:/13960/t2qxv15
 
The volumeID list consists of one or more HathiTrust proprietary volumeIDs separated by the pipe character “|”.  Each volumeID consists of a prefix, which is used by HathiTrust to identify the originating institution of the volume, followed by a dot symbol “.”, followed by a string ID that is defined by each contributing institution.  The string IDs from different institutions vary greatly, ranging from fixed-length zero-padded numeric IDs, to ARK DOIs containing filesystem unsafe characters such as colons ":" and slashes "/".
 
For the purpose of making a request to the Data API, the client can simply view each volumeID as a string, but may need to perform necessary URL encoding on the volumeID so it can be safely passed as a query parameter.  For processing the responses from the Data API, the client may find it necessary to separate the prefix and perform some string manipulation on the volumeIDs (please see more details on this in subsection 1.2.3).
 
The optional parameter “concat” controls whether the pages of a volume should be aggregated, as discussed in section 1.2.3.  If omitted, the default value is false.

1.2.3 Response from volume request

A correct request to the Data API receives a 200_OK status, and the body of the response is a binary ZIP stream, with the MIME type of “application/zip”.
 

 If the request was sent with concat=false, the returned ZIP file would have the following structure:

volumes.zip

  <cleaned_volumeID_1>/

    00000001.txt

    00000002.txt

    …

    000nnnnn.txt

  <cleaned_volumeID_2>/

    00000001.txt

    00000002.txt

    …

    000xxxxx.txt

  …

  <cleaned_volumeID_n>/

    000000001.txt

    …

    00000zzzz.txt

  ERROR.err
 
 
In the ZIP file, each requested volume is in its own directory, and the name of the directory is a “cleaned” mutation of the original volumeID in a filesystem safe format.  Each page of the volume is a text file named by its page sequence as an eight-digit fixed-length zero-padded number.
 
<cleaned_volumeID> := <prefix>“.”<cleaned_ID_string>
 
A “cleaned” volumeID consists of the institution-identifying prefix, followed by the dot character ‘.’, followed by the “cleaned” version of the original ID string.  This “cleaning” procedure is defined in the pairtree specification (https://wiki.ucop.edu/display/Curation/PairTree) so the directory name is filesystem safe.  However, it is worth pointing out that the prefix does not undergo the cleaning procedure.  This is because the prefix is not a part of the pairtree structure.
 
If an error occurs after the Data API has started sending the stream, the Data API injects a special ERROR.err file into the ZIP stream and then properly terminates the stream to prevent corruption.  ERROR.err contains information on the error.  If this file is present, the ZIP file may not contain all requested volumes/pages.   
 
On the other hand, if the request was sent with concat=true, the returned ZIP would have the following structure:
 
volumes.zip
   <cleaned_volumeID_1>.txt
   <cleaned_volumeID_2>.txt
   …
   <cleaned_volumeID_n>.txt
   ERROR.err
 
in this case, pages belong to each volume are no longer individual text files.  Instead, they are concatenated into a single text file, and the name of the text file is the cleaned volume ID with .txt extension.
 
If the entry ERROR.err is present, there was an issue retrieving some resources, and the ZIP stream was terminated prematurely to prevent corruption.  Some requested resources may be missing.

1.2.4 Requesting pages

Use the following command to request specific pages from a book instead of retrieving the entire volume:
 
https://<server>:<port>/data-api/pages?pageIDs=<pageID_list>[&concat=true]
 
<pageID_list> := <pageID> [“| <pageID> [“|...]]
<pageID> := <volumeID>“<”<pageID_1> [“,”<pageID_2 [“,”...]]“>
 
Example:
https://silvermaple.pti.indiana.edu:25443/data-api/pages?pageIDs=inu.3011012<1,2,20,30>|uc2.ark:/13960/t2qxv15<11,45,30,17,22,55>
 
A pageID list consists of one or more pageIDs separated by the pipe character “|”.  Each pageID is a volumeID followed by a comma-separated list of page sequence numbers enclosed in angle brackets.  The client may find it necessary to perform URL encoding on the pageID list so all characters can be safely passed as query parameters.
 
The optional parameter “concat” controls how these pages should be returned.  If omitted, the default value for concat is false.

1.2.5 Response from page request

A correct request returns a 200_OK status, and the body of the response is a binary ZIP stream with the MIME type of “application/zip”.
 
If the request was sent with concat=false, the returned ZIP file would have the following structure:
 
pages.zip
   <cleaned_volumeID_1>/
       0000000n.txt
       000000xx.txt
       …
   <cleaned_volumeID_2>/
       0000000p.txt
       0000wwww.txt
       …
   …
   <cleaned_volumeID_n>/
       0000qqqq.txt
       000zzzzz.txt
       …
   ERROR.err
 
Each volume has its own directory, with the requested pages as individual text files under the directory.  The name of the directory is the cleaned volumeID, and the name of each page text file is the eight-digit fixed-length zero-padded page sequence with .txt extension (see Section 1.2.4 for details on the cleaned volumeIDs).
 
If Data API encounters a problem during the retrieval, it terminates the ZIP stream to prevent corruption, but not before it injects an ERROR.err file into the stream which contains more information on the error.  If ERROR.err is present, the user can assume the ZIP stream does not contain all the resources requested.
 
If the request was sent with concat=true, the returned ZIP stream has the following structure:
 
pages.zip
   wordbag.txt
   ERROR.err
 
in this case, all requested pages are concatenated into a single text file regardless of which volume a page is from.  It essentially turns into a “bag of words”, thus the filename wordbag.txt
 
If there is an error, the Data API terminates the ZIP stream and injects ERROR.err with information on the problem encountered.  Should ERROR.err be present, the received ZIP stream is missing some requested resources.

1.2.6 Errors

The Data API detects error conditions as early as possible, and returns the corresponding error status and message to the client if an error is detected before it has committed to send the ZIP stream.
 
The table below lists the possible error status and response body a client may receive, as well as the meaning of the error.
 

 

...

Response Status

...

Response Body

...

Reason

...

 

...

VersionStatusMaturityComments
3.0.0ReleasedStable CQL-based access instead of Hector-based access
1.0ReleasedReleased 
1.0.1-SNAPSHOTUnder testingRelease candidateAdded token count feature

Table of Contents

Table of Contents
outlinetrue
excludeTable of Contents

Synopsis

The HTRC Data API is a RESTful web service for the retrieval of multiple volumes, pages of volumes, and METS metadata documents.  In order to support the efficient retrieval of volumes and pages in bulk, the Data API deviates from the typical RESTful API design out of necessity: Resources are not identified on the URL paths, but instead are sent as request parameters.

Use:

 

The HTRC Data API pulls full text OCR and METS metadata for specified volumes into the HTRC Data Capsules. You can download the full text of specified volumes using the HTRC Data API, with the volumeIDs of the desired volumes passed as parameters to the APIl. Volume IDs are standard identification numbers for items in the HathiTrust Digital Library. Currently, the HTRC Data API can only access a snapshot of public domain volumes. 
 

API

Note: all parameter values must be URL encoded

Retrieve Volumes

DescriptionReturns requested volumes
URL
/volumes
Supported Response Types

application/zip (normal response)

text/plain (error response)

MethodPOST
Request Types
application/x-www-form-urlencoded
Request Headers
`Authorization: Bearer [JWT_TOKEN]` (replacing `[JWT_TOKEN]` with the valid token)
Request Body

Request parameters as body content.  See Parameters below

Parameters


NameDescriptionTypeDefault valueRequiredNote
volumeIDsThe list of volumeIDs to be retrieved.stringN/AyesVolumeIDs are separated by the pipe character '|'
concatThe flag to indicate concatenation option.booleanfalsenoSee section on response format for details on its impact on the returned data
metsThe flag to indicate if METS document should be returnedbooleanfalseno 
versionA specific version of the Data API to usestringN/AnoNot implemented.  Place holder only


Responses


HTTP Status CodeResponse BodyResponse TypeDescription
200 (ok)A binary Zip stream
application/zip
Page content and metadata of the requested volumes aggregated as a Zip stream
400 (bad request)
Missing required parameter volumeIDs
text/plain
The required parameter volumeIDs is missing in the request
400 (bad request)
Malformed Volume ID List. Offending token: ${token}
text/plain

The value for volumeIDs is malformed and the Data API cannot parse it.  ${token} will be the token that causes the error.


Example


Expand


Description

Request for volumes inu.3011012 and uc2.ark:/13960/t2qxv15, with concatenation option enabled so each volume is a single text file in the returned Zip stream.

Raw volumeIDs

inu.3011012|uc2.ark:/13960/t2qxv15

URL encoded request bodyvolumeIDs=inu.3011012%7Cuc2.ark%3A%2F13960%2Ft2qxv15&concat=true


Code Block
languagebash
titleExample Request
curl -v -X POST -o volumes.zip \
        -d "volumeIDs=uc2.ark%3A%2F13960%2Ft12n5fs57" \
        -H "Content-Type: application/x-www-form-urlencoded" \
        -H "Authorization: Bearer TOKEN" \
        https://sandbox.htrc.illinois.edu:25443/data-api/volumes

Note: The TOKEN placeholder in the example request needs to have a valid value.


Retrieve Pages

DescriptionReturns requested pages
URL/pages
Supported Response Types

application/zip (normal response)

text/plain (error response)

MethodPOST
Request Types
application/x-www-form-urlencoded
Request Headers
Content-Type: application/x-www-form-urlencoded
Request BodyRequest parameters as body content.  See Parameters below
Parameters


NameDescriptionTypeDefault valueRequiredNote
pageIDsThe list of pageIDs to be retrievedstringN/AyesPageIDs are separated by the pipe character '|'
concatThe flag to indicate concatenation optionbooleanfalseno

See section on response format for details on its impact on the returned data

Note

"concat" and "mets" cannot be both set


metsThe flag to indicate if METS documents should be returnedbooleanfalseno


Note

"concat" and "mets" cannot be both set


versionA specific version of the Data API to usestringN/AnoNot implemented.  Place holder only


Responses


HTTP Status CodeResponse BodyResponse TypeDescription
200 (ok)A binary Zip streamapplication/zipPage content and metadata of the requested pages aggregated as a Zip stream
400 (bad request)Missing required parameter pageIDstext/plainThe required parameter pageIDs is missing in the request
400 (bad request)Malformed Page ID List. Offending token: ${token}text/plainThe value for pageIDs is malformed and the Data API cannot parse it.  ${token} will be the token that caused the error.
400 (bad request)Conflicting parameters in page retrieval. Offending Parameters: ${param1}, ${param2}text/plainSome request parameters have conflict. ${param1} and ${param2} will be the names of the parameters that caused the conflict. In the current version of the Data API, this is most likely caused by setting both "mets" and "concat" for page retrieval.


Example


Expand


Description

Request for the 1st, 2nd, 20th, and 30th pages of the volume inu.3011012, and the 11th, 17th, 22th, 30th, 45th, and 55th pages of the volume uc2.ark:/13960/t2qxv15, with each page being a separate text file along with the corresponding METS document of each volume in the returned Zip stream.

Raw pageIDsinu.3011012[1,2,20,30]|uc2.ark:/13960/t2qxv15[11,45,30,17,22,55]
URL encoded request bodypageIDs=inu.3011012%5B1%2C2%2C20%2C30%5D%7Cuc2.ark%3A%2F13960%2Ft2qxv15%5B11%2C45%2C30%2C17%2C22%2C55%5D&mets=true



Token Count (Deprecated)

DescriptionReturns token counts of requested volumes
URL/tokencount
Supported Response Types

application/zip (normal response)

text/plain (error response)

MethodPOST
Request Types
application/x-www-form-urlencoded
Request Headers
Content-Type: application/x-www-form-urlencoded
Request BodyRequest parameters as body content.  See Parameters below
Parameters


NameDescriptionTypeDefault ValueRequiredNote
volumeIDsthe list of volumes to be token countedstringN/AyesVolumeIDs are separated by the pipe character '|'
levelspecifies whether the token counts to be aggregated at volume level or page level.  Use "volume" for volume level, and "page" for page levelstringvolumeno 
sortByspecifies the token count output to be sorted on a fields. Use "token" for sorting based on the token's UTF-8 order, and "count" for sorting based on the token count order.  If left unspecified, the results do not guarantee any orders.stringN/AnoToken ordering is based on UTF-8 character values, so character "Z" comes before character "a" (if using ascending ordering).  For token count ordering, tokens with the same count are ordered by token's UTF-8 values.
sortOrderspecifies whether output to use ascending or descending ordering.  Use "asc" for ascending ordering, and "desc" for descending ordering.stringascnothis parameter only has effect when used together with sortBy, otherwise it is ignored.
versionA specific version of the Data API to usestringN/AnoNot implemented.  Place holder only


Responses


HTTP Status CodeResponse BodyResponse TypeDescription
200 (ok)a binary Zip Streamapplication/zipToken count output aggregated as a Zip stream
400 (bad request)Missing required parameter volumeIDstext/plainThe required parameter volumeIDs is missing in the request
400 (bad request)Malformed Volume ID List. Offending token: ${token}text/plainThe value for volumeIDs is malformed and the Data API cannot parse it. ${token} will be the token that caused the error.


Example


Expand


DescriptionRequest for page level token count of the volumes inu.3011012 and uc2.ark:/13960/t2qxv15, with the token count output to be sorted by the tokens in descending order
Raw volumeIDsinu.3011012|uc2.ark:/13960/t2qxv15
URL encoded request bodyvolumeIDs=inu.3011012%7Cuc2.ark%3A%2F13960%2Ft2qxv15&level=page&sortBy=token&sortOrder=desc



 

Response Format

Zip Structure Layout

The directory structure layout of the Zip stream returned from the Data API may be one of the following patterns depending on the optional parameters:

Note

Strictly speaking, inside a Zip file the structure is flat, so there is no "directories" but only file entries.  However, in practice almost all Zip tools give the illusion of directories by leveraging the slash characters '/' in the name of each Zip entry.  For the discussion here, we follow such practice and treat the inside of a Zip file as if it were a conventional filesystem.

Suppose there are 2 hypothetical volumes in the corpus: foo.001122, which has 5 pages, and bar.ark:/13960/t123, which has 3 pages.  Both volumes also have the associated METS xml files.  The client tries to request for these 2 volumes, and also tries to request for another volume gon.000000 that no longer exists in the corpus, which would cause the entry ERROR.err to be included in the returned Zip strea.

Request DescriptionZip Structure LayoutExplanation of Entries

Retrieve volumes

concat=false

volumes.zip

    |-- foo.001122/

    |    |-- 00000001.txt

    |    |-- 00000002.txt

    |    |-- 00000003.txt

    |    |-- 00000004.txt

    |    |-- 00000005.txt

    |    \-- mets.xml

    |-- bar.ark+=13960=t123/

    |    |-- 00000001.txt

    |    |-- 00000002.txt

    |    |-- 00000003.txt

    |    \-- mets.xml

    |-- volume-rights.txt

    \-- ERROR.err

 

 

Because the request parameter concat=false, each volume has its own directory, and the pages and metadata documents of each volume are individual files stored under the volume directory.

foo.001122/ is a directory named after the first volume, foo.001122.  The directory name underwent a Pairtree clean process, but since it does not contain any filesystem unsafe characters, the cleaned ID looks the same as the original.

Inside foo.001122/, files 00000001.txt through 00000005.txt are the 5 pages of this volume

mets.xml will also be inside of foo.001122/ if the request parameter mets=true

bar.ark+=13960=t123/ is a directory named after the second volume, bar.ark:/13960/t123.  The directory name underwent a Pairtree clean process so that filesystem-unsafe characters such as colons ':' and slashes '/' are escaped and replaced with filesystem-safe characters.

Inside bar.ark+=13960=t123/, files 00000001.txt through 00000003.txt are the 3 pages of this volume.

mets.xml will also be inside of bar.ark+=13960=t123/ if the request parameter mets=true

volume-rights.txt is a file at the top level. It contains the HTRC Data Protection Level for each volume.

ERROR.err is a file at the top level.  It is present if the request encountered some errors and the detailed error information is stored in this file.  In this example, its presence is caused by the request for a non-existent volume gon.000000

Retrieve volumes

concat=true

volumes.zip

    |-- foo.001122.txt

    |-- foo.001122.mets.xml

    |-- bar.ark+=13960=t123.txt

    |-- bar.ark+=13960=t123.mets.xml

    |-- volume-rights.txt

    \-- ERROR.err

Because the request parameter concat=true, each volume is a single text file, where the pages of the volume are concatenated into the file in the page order.

foo.001122.txt is the text file entry for the volume foo.001122.  The filename underwent a Pairtree clean process, but since it does not contain any filesystem unsafe characters, the cleaned ID looks identical to the original.

foo.001122.mets.xml will be present if the request parameter mets=true.

bar.ark+=13960=t123.txt is the text file entry for the volume bar.ark:/13960/t123.  The filename underwent a Pairtree clean process, so filesystem-unsafe characters such as colons ':' and slashes '/' are replaced with filesystem-safe characters.

bar.ark+=13960=t123.mets.xml will be present if the request parameter mets=true.

volume-rights.txt is a file at the top level. It contains the HTRC Data Protection Level for each volume.

ERROR.err is a file at the top level.  It is present if the request encountered some errors and the detailed error information is stored in this file.  In this example, its presence is caused by the request for a non-existent volume gon.000000

 

 

 

Retrieve pages

concat=false

pages.zip

    |-- foo.001122/

    |    |-- 00000001.txt

    |    |-- 00000002.txt

    |    |-- 00000003.txt

    |    |-- 00000004.txt

    |    |-- 00000005.txt

    |    \-- mets.xml

    |-- bar.ark+=13960=t123/

    |    |-- 00000001.txt

    |    |-- 00000002.txt

    |    |-- 00000003.txt

    |    \-- mets.xml

    |-- volume-rights.txt

    \-- ERROR.err

The Zip stream returned from the Data API for page retrieval with the request parameter concat=false is very similar to that returned for volume retrieval with concat=false.  The difference is that only pages requested for will be included.

Retrieve pages

concat=true

pages.zip

    |-- wordseq.txt

    \-- ERROR.err

Because the request parameter concat=true, the returned Zip stream is a "sequence of words" where the content of all pages from all volumes is aggregated into a single text file entry named wordseq.txt.

ERROR.err is a file at the top level.  It is present if the request encountered some errors and the detailed error information is stored in this file.  In this example, its presence is caused by the request for a non-existent volume gon.000000

Note that there is no METS metadata returned because mixing METS metadata and page content into the word sequence could potentially contaminate the information in the word sequence file.

Token count

(Deprecated)

level=volume

tokencount.zip

    |-- foo.001122.count

    |-- bar.ark+=13960=t123.count

    \-- ERROR.err

Because the request parameter level=volume, the returned Zip stream contains the token count of each volume as an entry, and the name of the entry is the Pairtree cleaned volumeID with ".count" as the extension.

ERROR.err is a file at the top level.  It is present if the request encountered some errors and the detailed error information is stored in this file.  In this example, its presence is caused by the request for a non-existent volume gon.000000

Token count

(Deprecated)

level=page

tokencount.zip

    |-- foo.001122/

    |    |-- 00000001.count

    |    |-- 00000002.count

    |    |-- 00000003.count

    |    |-- 00000004.count

    |    \-- 00000005.count

    |-- bar.ark+=13960=t123/

    |    |-- 00000001.count

    |    |-- 00000002.count

    |    \-- 00000003.count

    \-- ERROR.err

 

Because the request parameter level= page, in the returned Zip stream, each volume is a directory whose entry name is the Pairtree cleaned volumeID, and each page of the volume is an entry under the directory, and the name of the page is the 8-digit zero-padded page sequence number followed by ".count" extension.

ERROR.err is a file at the top level.  It is present if the request encountered some errors and the detailed error information is stored in this file.  In this example, its presence is caused by the request for a non-existent volume gon.000000

Token Count Output Format and Sorting Order (Deprecated)

Each token count output entry is a list of tokens and number of occurrences within the aggregation.  The token and its occurrence count is separated by a space character (0x20), and each token-occurrence pair is a line and is separated from other pairs by a new line character (0x0A).  However, if an aggregation does not contain any texts (e.g. an empty page), that particular entry will be empty.

sortBy & sortOrderToken Count OutputDescription
unspecified

orange 1

banana 2

acorn 2

A-team 1

Xylophone 3

apple 1

coconut 1

if the parameter sortBy is not specified, the returned result does not guarantee any ordering of the token-occurrence pairs, nor does it guarantee the same ordering of these pairs between any 2 runs with the exact same parameters

sortBy=token&

sortOrder=asc

A-team 1

Xylophone 3

acorn 2

apple 1

banana 2

coconut 1

orange 1

with ascending ordering on the tokens, the returned result is sorted using the UTF-8 value of the tokens in ascending order.  In this example, the tokens starting with capital letter "X" come before these starting with lower case letter "a".

sortBy=token&

sortOrder=desc

orange 1

coconut 1

banana 2

apple 1

acorn 2

Xylophone 3

A-team 1

this is the exact reverse of the case above

sortBy=count&

sortOrder=asc

A-team 1

apple 1

coconut 1

orange 1

acorn 2

banana 2

Xylophone 3

with ascending ordering on the occurrence count, the returned result is sorted using the count value in ascending order; however, when multiple tokens have the same count, the order is determined by the ascending ordering of the tokens.

sortBy=count&

sortOrder=desc

Xylophone 3

banana 2

acorn 2

coconut 1

apple 1

A-team 1

this is the exact reverse of the above case, and specifically, when multiple tokens have the same count, the order is determined by the descending ordering of the tokens.

Access Data API with JWT

While the Data API by itself does not enforce any security mechanism for authentication and/or authorization, it is can only be directly called using JWT in Secure Mode while in a Capsule.  The Capsules come with fixed JWT saved to the image that you will use to make API calls. The scripts are with the tokens are saved at /home/dcuser/.htrc in each Capsule.