The HTRC Solr Proxy is a thin layer over a Solr service to limit access, audit, and provide additional functionalities. It is running on http://chinkapin.pti.indiana.edu:9994 and we use it in our examples.
HTRC solr proxy have two cores, one for ocr full text search and the other for metadata search including all marc bib fields and htrc-contributed fields (field names with prefix "htrc_"). The ocr core let users to issue a query against ocr field and get a list of result volume IDs back. The metadata core gives users more choices to do various queries on different fields. So before introducing the queries, the table below gives the frequently used metadata fields that HTRC metadata solr core has and how they are processed. The indexed fields are searchable where content of stored fields can be returned to users.
The table below shows the solr core metadata fields:
|id||Y||Y||Field for volume ID|
|fullrecord||N||Y||Field for storing the full marc record|
|title||Y||Y||Field for book title|
|author||Y||Y||Author for a volume|
|oclc||Y||Y||The OCLC Control Number|
|isbn||Y||Y||The International Standard Book Number (ISBN) for a volume|
|issn||Y||Y||The International Standard Serial Number (ISSN) for a volume|
|callnumber||Y||Y||The call number for a volume|
|sudoc||Y||Y||Superintendent of Documents number for a volume. Please refer to http://www.fdlp.gov/cataloging/856-sudoc-classification-scheme?start=3|
|language||Y||N||Field for the languages in which a volume is written|
|htsource||Y||Y||Field for the sources of a volume, e.g. "Indiana University"|
|era||Y||N||The era of the volume|
|geographic||Y||N||Brief geographic information for a volume, e.g."pennsylvania"|
|country_of_pub||Y||N||The publication country of the book|
|topic||Y||N||Topic of the volume|
|genre||Y||N||Genre the volume belongs to|
|publishDate||Y||Y||Publish date of the volume|
|publisher||Y||Y||publisher of the volume|
|edition||Y||Y||edition of the volume|
|allfields||N||Y||all bibliographic fields above except OCR filed are indexed in this fields. This is the default search field.|
|htrc_pageCount||Y||Y||the page count of the volume|
|htrc_wordCount||Y||Y||the word count of the volume (htrc use lucene3.6 standard tokenizer to split words)|
|htrc_charCount||Y||Y||the character count of the volume|
Gender of the authors of this volume. This can be either male or female or both.
Male/female/gender-unknown authors of this volume, multivalued
|htrc_volumePageCountBin||Y||Y||quartile info of this volume based on its volume page count. Values are S/M/L/XL|
|htrc_volumeWordCountBin||Y||Y||quartile info of this volume based on its word count. Values are S/M/L/XL|
Some fields are indexed but not stored. These fields do not appear in the search result but are very useful for faceted search, which HTRC blacklight heavily relies on.
For ocr core, it also has many of these metadata fields but users are not encouraged to send query requests against metadata fields to solr ocr core. These metadata fields in solr ocr core are stale and not to be maintained in the future. In fact, HTRC will remove all the metadata fields from the ocr core so that it will only have two fields, id and ocr, and all metadata queries should go to the metadata core.
All the basic queries are allowed. "Update" operations are banned. Detailed instructions can be found at http://wiki.apache.org/solr/SolrQuerySyntax and http://wiki.apache.org/solr/CommonQueryParameters.
Because HTRC have very large index files, distributed search is used to utilize more system resource. Previously "qt=sharding" needed to be appended to the REST call to make sure that the query was sent over to all shards. But now users do not need to worry about that because the default "qt" is "sharding". Users can also specify explicitly what query type they want to overwrite the default "qt" parameter. What follows are instructions for the most frequently used queries; these queries are sufficient for most uses:
Basic term query pattern
for example, "title: war" returns volume IDs of all volumes whose title field contains the word "war".
Simple concatenation of two queries by "AND" or "OR":
returns volume IDs that have "war" in the title and are written by author named "Hill".
Numeric range query
HTRC solr has only one numeric field:
will return all volume IDs whose publishDate is between 1990 and 1999. This is inclusive of 1990 and 1999.
* is used here for zero or more characters
returns all volume IDs whose title field contains a word starting with "chil". If you want all the volumes in this index, just use "*:*".
Making a request
To issue a basic query to HTRC Solr Proxy metadata core, the full RESTful request will be:
returns volume IDs that have "war" in the title.
Other query parameters
You can also set the number of results you want by setting "rows".
returns the top 5 hits of the volumes that have "war in title.
Ranking is done by Solr itself. Apache Solr's scoring and ranking mechanism is based on combination of Boolean Model (BM) of Information Retrieval with Vector Space Model (VSM) of Information Retrieval - documents "approved" by BM are scored by VSM.
The solr query parameters are very flexible. For more details, please refer to http://wiki.apache.org/solr/CommonQueryParameters .
The facet query syntax is:
returns all volume IDs in the index; at the bottom of the result set, all genres will be listed with the number of volumes belonging to each genre in the search result.
For more details of facet parameters, please refer to http://wiki.apache.org/solr/SimpleFacetParameters .
Users can do full text search on the ocr field through solr ocr core. Basic query, boolean query, phrase query and wildcard are all supported by ocr core. When doing an ocr full text search, remember to use a slightly different REST call pattern:
Otherwise, a 400 response code will be returned because there is no ocr field in meta core.
A quick example of ocr full text search:
returns the volumes that have "hathitrust" in their textual content.
Retrieving MARC Records
Users can download MARC records given a set of volume IDs by HTRC Solr API. The downloaded file is a zip file that contains the MARC records for the specified IDs. In the zip file, each zip entry is a MARC record for a volume and the zip entry's name is the volume ID.
The syntax is :
returns a zip file that contains two entries (MARC records), one for miua.2916929.0001.001 and the other for miua.2088345.0001.001.
The MARC records in HT Digital Library contains a field for the Library of Congress (LOC) catalog. HTRC indexed it and users can search by LOC catalog. The MARC field is called "callnumber" as shown in the above table. The query for searching volumes by Library of Congress (LOC) catalogs looks like
This retrieves a volume that has LOC call number of E181.B77