Sorting, Paging, and Deep Paging in Solr


Sorting, Paging, and Deep Paging in Solr

Basic Sorting

First let’s add 12 documents (in this case metadata about books) to Solr in CSV format (Comma Separated Values):

$ curl http://localhost:8983/solr/update?commitWithin=5000 -H 'Content-type:text/csv' -d '
id,cat,pubyear_i,title,author,series_s,sequence_i
book1,fantasy,2000,A Storm of Swords,George R.R. Martin,A Song of Ice and Fire,3
book2,fantasy,2005,A Feast for Crows,George R.R. Martin,A Song of Ice and Fire,4
book3,fantasy,2011,A Dance with Dragons,George R.R. Martin,A Song of Ice and Fire,5
book4,sci-fi,1987,Consider Phlebas,Iain M. Banks,The Culture,1
book5,sci-fi,1988,The Player of Games,Iain M. Banks,The Culture,2
book6,sci-fi,1990,Use of Weapons,Iain M. Banks,The Culture,3
book7,fantasy,1984,Shadows Linger,Glen Cook,The Black Company,2
book8,fantasy,1984,The White Rose,Glen Cook,The Black Company,3
book9,fantasy,1989,Shadow Games,Glen Cook,The Black Company,4
book10,sci-fi,2001,Gridlinked,Neal Asher,Ian Cormac,1
book11,sci-fi,2003,The Line of Polity,Neal Asher,Ian Cormac,2
book12,sci-fi,2005,Brass Man,Neal Asher,Ian Cormac,3
'

 
Now we can issue a query with the following parameters:
q=id:book* matches document ids that start with “book”
sort=pubyear_i desc sorts matches in descending order by the year of publication
fl=title,pubyear_i “fl” stands for “field list”, the stored fields to return for the resulting matches.
 
(simply click the link below if Solr is up and running on the same box as your browser)

http://localhost:8983/solr/query?
   q=id:book*
   &sort=pubyear_i desc
   &fl=title,pubyear_i
{
  "responseHeader":{
    "status":0,
    "QTime":2,
    "params":{
      "fl":"title,pubyear_i",
      "sort":"pubyear_i desc",
      "q":"id:book*"}},
  "response":{"numFound":12,"start":0,"docs":[
      {
        "pubyear_i":2011,
        "title":["A Dance with Dragons"]},
      {
        "pubyear_i":2005,
        "title":["A Feast for Crows"]},
      {
        "pubyear_i":2005,
        "title":["Brass Man"]},
      {
        "pubyear_i":2003,
        "title":["The Line of Polity"]},
      {
        "pubyear_i":2001,
        "title":["Gridlinked"]},
      {
        "pubyear_i":2000,
        "title":["A Storm of Swords"]},
      {
        "pubyear_i":1990,
        "title":["Use of Weapons"]},
      {
        "pubyear_i":1989,
        "title":["Shadow Games"]},
      {
        "pubyear_i":1988,
        "title":["The Player of Games"]},
      {
        "pubyear_i":1987,
        "title":["Consider Phlebas"]}]
  }}

Note that we found 12 books (see numFound in the response above) but there are only 10 books in the response. This is because the rows parameter defaults to 10.

 

Basic Paging

There are two parameters that control paging:
start – The starting offset into the ranked (sorted) list of documents. Defaults to 0.
rows – The maximum number of documents to return. Defaults to 10.

For example, if we add start=3 and rows=2 to the example query above, we should get the 4th and 5th books in the ranked document list.

http://localhost:8983/solr/query?
   q=id:book*
   &sort=pubyear_i desc
   &fl=title,pubyear_i
   &start=3
   &rows=2
  {"response":{"numFound":12,"start":3,"docs":[
      {
        "pubyear_i":2003,
        "title":["The Line of Polity"]},
      {
        "pubyear_i":2001,
        "title":["Gridlinked"]}]
  }}

 

Deep Paging

Deep paging refers to specifying a large start offset into the search results.
Basic paging can be inefficient with large start values since to return documents 1,000,000 through 1,000,010 in a sorted document list (only 10 documents), the search engine must find the top 1,000,010 documents and then take the last 10 to return to the user. Solr is smart enough to only retrieve the stored fields for the final 10 documents, but there is still the overhead of sorting the internal ids of the top 1,000,010 documents.

Deep paging via basic paging controls is even more inefficient for distributed searches (SolrCloud) since the sort values for the first 1,000,010 documents from each shard need to be returned and merged at an aggregator node in order to find the correct 10.

Deep Paging with a Cursor

The cursorMark parameter allows efficient iteration over a large result set. It works on both a single node and with distributed searches and SolrCloud mode.

Using cursorMark:

  • sort must include a tie-breaker sort on the id field. This prevents tie-breaking by internal lucene document id (which can change).
  • start must be 0 for all calls including a cursorMark.
  • pass cursorMark=* for the first request.
  • Solr will return a nextCursorMark in the response. Simply use this value for cursorMark on the next call to continue paging through the results.

 
First request:

http://localhost:8983/solr/query?
   q=id:book*
   &sort=pubyear_i desc, id asc
   &fl=title,pubyear_i
   &rows=5
   &cursorMark=*
  {"response":{"numFound":12,"start":0,"docs":[
      {
        "pubyear_i":2011,
        "title":["A Dance with Dragons"]},
      {
        "pubyear_i":2005,
        "title":["A Feast for Crows"]},
      {
        "pubyear_i":2005,
        "title":["Brass Man"]},
      {
        "pubyear_i":2003,
        "title":["The Line of Polity"]},
      {
        "pubyear_i":2001,
        "title":["Gridlinked"]}]
  },
  "nextCursorMark":"AoJRfSVib29rNw=="}

 
For the next request, simply set cursorMark to the value we received for nextCursorMark.

http://localhost:8983/solr/query?
   q=id:book*
   &sort=pubyear_i desc, id asc
   &fl=title,pubyear_i
   &rows=5
   &cursorMark=AoJRfSVib29rNw==
  {"response":{"numFound":12,"start":0,"docs":[
      {
        "pubyear_i":2000,
        "title":["A Storm of Swords"]},
      {
        "pubyear_i":1990,
        "title":["Use of Weapons"]},
      {
        "pubyear_i":1989,
        "title":["Shadow Games"]},
      {
        "pubyear_i":1988,
        "title":["The Player of Games"]},
      {
        "pubyear_i":1987,
        "title":["Consider Phlebas"]}]
  },
  "nextCursorMark":"AoJTfCVib29rNA=="}

 
And our final request, again setting cursorMark to the new value we received for nextCursorMark in the response.

http://localhost:8983/solr/query?
   q=id:book*
   &sort=pubyear_i desc, id asc
   &fl=title,pubyear_i
   &rows=5
   &cursorMark=AoJTfCVib29rNA==
  {"response":{"numFound":12,"start":0,"docs":[
      {
        "pubyear_i":1984,
        "title":["Shadows Linger"]},
      {
        "pubyear_i":1984,
        "title":["The White Rose"]}]
  },
  "nextCursorMark":"AoJQfCZib29rMTE="}

Deep Paging cursorMark implementation notes

  • The cursorMark parameter itself contains all the necessary state. There is no server-side state.
  • The start parameter returned is always 0. It’s up to the client to figure out (or remember) what the position is for display purposes.
  • There is no need to page to the end of the result set with cursorMark (since there is no server-side state kept). Stop where ever you want.
  • You know you have reached the end of a result set when you do not get back the full number of rows requested, or when the nextCursorMark returned is the same as the cursorMark you sent
  • (at which point, no documents will be in the returned list).

  • Although start must always be 0, you can vary the number of rows for every call to vary the page size.
  • You can re-use cursorMark values, changing other things like what stored fields are returned or what fields are faceted.
  • A client can efficiently go back pages by remembering previous cursorMarks and re-submitting them.