Child pages
  • Solr Proxy API User Guide
Skip to end of metadata
Go to start of metadata


The HTRC Solr Proxy is a thin layer over a Solr service to limit access, audit, and provide additional functionalities. It is running on and we use it in our examples.

HTRC solr proxy have two cores, one for ocr full text search and the other for metadata search including all marc bib fields and htrc-contributed fields (field names with prefix "htrc_"). The ocr core let users to issue a query against ocr field and get a list of result volume IDs back. The metadata core gives users more choices to do various queries on different fields. So before introducing the queries, the table below gives the frequently used metadata fields that HTRC metadata solr core has and how they are processed. The indexed fields are searchable where content of stored fields can be returned to users.


The table below shows the solr core metadata fields:

Field nameIndexedStoredExplanation
idYYField for volume ID
fullrecordNYField for storing the full marc record
titleYYField for book title 
authorYYAuthor for a volume
oclcYYThe OCLC Control Number 
isbnYYThe International Standard Book Number (ISBN) for a volume
issnYYThe International Standard Serial Number (ISSN) for a volume
callnumberYYThe call number for a volume
sudocYYSuperintendent of Documents number for a volume. Please refer to 
languageYNField for the languages in which a volume is written
htsourceYYField for the sources of a volume, e.g. "Indiana University"
eraYNThe era of the volume
geographicYNBrief geographic information for a volume, e.g."pennsylvania"
country_of_pubYNThe publication country of the book
topicYNTopic of the volume
genreYNGenre the volume belongs to
publishDateYYPublish date of the volume
publisherYYpublisher of the volume
editionYYedition of the volume
allfieldsNYall bibliographic fields above except OCR filed are indexed in this fields. This is the default search field.
htrc_pageCountYYthe page count of the volume
htrc_wordCountYYthe word count of the volume (htrc use lucene3.6 standard tokenizer to split words)
htrc_charCountYYthe character count of the volume

Gender of the authors of this volume. This can be either male or female or both.



Male/female/gender-unknown authors of this volume, multivalued

htrc_volumePageCountBinYYquartile info of this volume based on its volume page count. Values are S/M/L/XL
htrc_volumeWordCountBinYYquartile info of this volume based on its word count. Values are S/M/L/XL

Some fields are indexed but not stored. These fields do not appear in the search result but are very useful for faceted search, which HTRC blacklight heavily relies on. 

For ocr core, it also has many of these metadata fields but users are not encouraged to send query requests against metadata fields to solr ocr core. These metadata fields in solr ocr core are stale and not to be maintained in the future. In fact, HTRC will remove all the metadata fields from the ocr core so that it will only have two fields, id and ocr, and all metadata queries should go to the metadata core. 


Here are some usage examples:

To retrieve the list of volumeIDs for those volumes which have authors with "eliot" in their names, point your browser to:

Likewise, to retrieve the list of volumeIDs for those volumes which have titles containing the word "may", I'd point my browser to: 


To retrieve the list of volumeIDs for those volumes which are in the Spanish language, point your browser to: 


  1. Case matters. 
  2. If you do not priorly know the value that you need to search as for the value in your query, then you can always try to look at the results as a facet. (There is no limit to the number of results that can be returned, and so this should be used cautiously.) In the case of the query below (an example query that will be useful if you did  not priorly know what is the controlled vocabulary for denoting Spanish language books, i.e., if you did not know that you would need to search for spanish in the language field with the word "Spanish" which has the "S" capitalized as shown in the above example), the number of rows has been set to 0, so that you only get the facet counts when you execute this query:

Basic Queries

All the basic queries are allowed. "Update" operations are banned. Detailed instructions can be found at and

Because HTRC have very large index files, distributed search is used to utilize more system resource. Previously "qt=sharding" needed to be appended to the REST call to make sure that the query was sent over to all shards. But now users do not need to worry about that because the default "qt" is "sharding". Users can also specify explicitly what query type they want to overwrite the default "qt" parameter. What follows are instructions for  the most frequently used queries; these queries are sufficient for most uses:

Basic term query pattern

{field name} : {term}

for example, "title: war" returns volume IDs of all volumes whose title field contains the word "war".

Boolean query

Simple concatenation of two queries by "AND" or "OR":

 {field name1} : {term1} AND {field name2} : {term2}

 {field name1} : {term1}  OR  {field name2} : {term2}


 title:war AND author: Hill

returns volume IDs that have "war" in the title and are written by author named "Hill".

Numeric range query

HTRC solr has only one numeric field:  publishDate

{field name} : {[ value1 TO value2]} //value1 and value2 are included.      


publishDate : [1990 TO 1999]

 will return all volume IDs whose publishDate is between 1990 and 1999. This is inclusive of 1990 and 1999.

Prefix query

* is used here for zero or more characters 

{field name}:{ prefix*}  


title: chil*       

returns all volume IDs whose title field contains a word starting with "chil". If you want all the volumes in this index, just use "*:*".

Making a request

To issue a basic query to HTRC Solr Proxy metadata core, the full RESTful request will be:

http://{hostname}:{port number}/solr/meta/select/?q={basic query}


returns volume IDs that have "war" in the title.

Other query parameters

You can also set the number of results you want by setting "rows".


returns the top 5 hits of the volumes that have "war in title.
Ranking is done by Solr itself. Apache Solr's scoring and ranking mechanism is based on combination of Boolean Model (BM) of Information Retrieval with Vector Space Model (VSM) of Information Retrieval - documents "approved" by BM are scored by VSM.
The solr query parameters are very flexible. For more details, please refer to

Faceted Queries

The facet query syntax is:

http://{hostname}:{port number}/solr/meta/select/?q={basic query}&facet=on&facet.field={field}


returns all volume IDs in the index;  at the bottom of the result set, all genres will be listed with the number of volumes belonging to each genre in the search result. 

For more details of facet parameters, please refer to .

Full-text Queries

Users can do full text search on the ocr field through solr ocr core. Basic query, boolean query, phrase query and wildcard are all supported by ocr core. When doing an ocr full text search, remember to use a slightly different REST call pattern:

http://{hostname}:{port number}/solr/ocr/select/?q={query string}

Otherwise, a 400 response code will be returned because there is no ocr field in meta core. 

<warn>RESPONSE CODE: 400</warn>

A quick example of ocr full text search:

returns the volumes that have "hathitrust" in their textual content.

Retrieving MARC Records

Users can download MARC records given a set of volume IDs by HTRC Solr API. The downloaded file is a zip file that contains the MARC records for the specified IDs. In the zip file, each zip entry is a MARC record for a volume and the zip entry's name is the volume ID.
The syntax is :


Here IDs are separated by "|" for specifying more than one ID.

returns a zip file that contains two entries (MARC records), one for miua.2916929.0001.001 and the other for miua.2088345.0001.001.

More examples

The MARC records in HT Digital Library contains a field for the Library of Congress (LOC) catalog. HTRC indexed it and users can search by LOC catalog. The MARC field is called "callnumber" as shown in the above table. The query for searching volumes by Library of Congress (LOC) catalogs looks like"E181.B77"

This retrieves a volume that has LOC call number of E181.B77


  • No labels