Child pages
  • Solr Proxy API User Guide
Skip to end of metadata
Go to start of metadata

The HTRC Solr Proxy is a thin layer over a Solr service to limit access, audit, and provide additional functionalities. It is running on http://chinkapin.pti.indiana.edu:9994 and we use it in our examples.

HTRC solr proxy have two cores, one for ocr full text search and the other for metadata search including all marc bib fields and htrc-contributed fields (field names with prefix "htrc_"). The ocr core let users to issue a query against ocr field and get a list of result volume IDs back. The metadata core gives users more choices to do various queries on different fields. So before introducing the queries, the below table gives the frequently used metadata fields that HTRC metadata solr core has and how they are processed. The indexed fields are searchable where content of stored fields can be returned to users.
 

For metadata solr core: 

Field nameIndexedStoredExplanation
idYYField for volume ID
fullrecordNYField for storing the full marc record
titleYYField for book title 
authorYYAuthor for a volume
oclcYYThe OCLC Control Number 
rptnumYY  
sdrnumYY 
isbnYYThe International Standard Book Number (ISBN) for a volume
issnYYThe International Standard Serial Number (ISSN) for a volume
callnumberYYThe call number for a volum
sudocYYSuperintendent of Documents number for a volume. Please refer to http://www.fdlp.gov/cataloging/856-sudoc-classification-scheme?start=3 
languageYNField for the languages in which a volume is written
htsourceYYField for the sources of a volume, e.g. "Indiana University"
eraYNThe era of the volume
geographicYNBrief geographic information for a volume, e.g."pennsylvania"
country_of_pubYNThe publication country of the book
topicYNTopic of the volume
genreYNGenre the volume belongs to
publishDateYYPublish date of the volume
publisherYYpublisher of the volume
editionYYedition of the volume
allfieldsNYall bibliographic fields above except OCR filed are indexed in this fields. This is the default search field.
htrc_pageCountYYthe page count of the volume
htrc_wordCountYYthe word count of the volume (htrc use lucene3.6 standard tokenizer to split words)
htrc_charCountYYthe character count of the volume
htrc_genderYY

Gender of the authors of this volume. This can be either male or female or both.

htrc_genderMale/htrc_genderFemale/htrc_genderUnknown

YY

Male/female/gender-unknown authors of this volume, multivalued

htrc_volumePageCountBinYYquartile info of this volume based on its volume page count. Values are S/M/L/XL
htrc_volumeWordCountBinYYquartile info of this volume based on its word count. Values are S/M/L/XL

 

We can notice that some fields are indexed but not stored. These fields do not appear in the search result but are very useful for faceted search, which HTRC blacklight heavily relies on. 

For ocr core, it also has many of these metadata fields but users are not encouraged to send query requests against metadata fields to solr ocr core. These metadata fields in solr ocr core are stale and not to be maintained in the future. In fact, HTRC will remove all the metadata fields from the ocr core so that it will only have two fields, id and ocr, and all metadata queries should go to the metadata core. 

 

1.1. HTRC Solr basic queries

All the basic queries are allowed. "Update" operations are banned. Detailed instructions can be found at  http://wiki.apache.org/solr/SolrQuerySyntax and http://wiki.apache.org/solr/CommonQueryParameters.

Because HTRC have very large index files, distributed search is used to utilize more system resource. Previously "qt=sharding" needed to be appended to the REST call to make sure that the query was sent over to all shards. But now users do not need to worry about that because the default "qt" is "sharding". Users can also specify explicitly what query type they want to overwrite the default "qt" parameter. What follows are instructions for  the most frequently used queries; these queries are sufficient for most uses:

(1) basic term query pattern:

for example, "title: war" returns volume IDs of all volumes whose title field contains the word "war".

(2) Boolean query:  simple concatenation of two queries by "AND" or "OR"

Example:

returns volume IDs that have "war" in the title and are written by author named "Hill".

(3) Numeric Range Query:  HTRC solr has only have one numeric field:  publishDate field .

Example:

 will return all volume IDs whose publishDate is between 1990 and 1999. This is inclusive of 1990 and 1999.

(4) Prefix query: "*" is used here for zero or more characters 

Example:  

returns all volume IDs whose title field contains a word starting with "chil". If you want all the volumes in this index, just use "*:*".

To issue a basic query to HTRC Solr Proxy metadata core, the full RESTful request will be:

Example: 

returns volume IDs that have "war" in the title.
 
You can also set the number of results you want by setting "rows".

Example:

returns the top 5 hits of the volumes that have "war in title.
 
Ranking is done by Solr itself. Apache Solr's scoring and ranking mechanism is based on combination of Boolean Model (BM) of Information Retrieval with Vector Space Model (VSM) of Information Retrieval - documents "approved" by BM are scored by VSM.
 
The solr query parameters are very flexible. For more details, please refer to http://wiki.apache.org/solr/CommonQueryParameters .
 

1.2. Facet Query

Facet query syntax is:

Example:

returns all volume IDs in the index;  at the bottom of the result set, all genres will be listed with the number of volumes belonging to each genre in the search result. 

 
For more details of facet parameters, please refer to http://wiki.apache.org/solr/SimpleFacetParameters .


1.3. HTRC Solr full text search

Users can do full text search on ocr field through solr ocr core. Basic query, boolean query, phrase query and wildcard are all supported by ocr core. When doing an ocr full text search, remember to use a slightly different REST call pattern:

Otherwise, a 400 response code will be returned because there is no ocr field in meta core. 

A quick example of ocr full text search:

returns the volumes that have "hathitrust" in their textual content.

 

1.4. Getting MARC Records

Users can download MARC records given a set of volume IDs by HTRC Solr API. The downloaded file is a zip file that contains the MARC records for the specified IDs. In the zip file, each zip entry is a MARC record for a volume and the zip entry's name is the volume ID.
 
The syntax is :

Here IDs are separated by "|" for specifying more than one ID.
 
Example:  

returns a zip file that contains two entries (MARC records), one for miua.2916929.0001.001 and the other for miua.2088345.0001.001.

  • No labels