Percentiles for Solr Faceting


The percentile aggregation function was just added to the new Solr Facet Module.
This allows one to calculate one or more percentiles for each facet bucket (i.e. each group of documents produced by faceting), and even sort facet buckets by any given percentile.

The percentile aggregation even works with distributed search!
The algorithm used is Ted Dunnings “t-digest”, which gives good approximations with relatively little memory consumption.

Percentiles Quick Start

NOTE: requires Solr 5.3 or later.

First, let’s start Solr and create a “demo” collection.

$ bin/solr start
$ bin/solr create -c demo
# HINT: use "bin/solr stop -all" when you're finished.

Now, lets index some salary survey data in CSV format, using dynamic fields:

$ curl http://localhost:8983/solr/demo/update?commitWithin=5000 -H 'Content-type:text/csv' -d '
id,gender_s,loc_s,year_i,job_s,salary_d
mark,M,NJ,2011,clerk,21250
john,M,NY,2011,engineer,42500
mary,F,CT,2015,manager,87299
alice,F,NJ,2013,dentist,75000
mike,M,NY,2012,sales,59500
nancy,F,CT,2014,engineer,110000
greg,M,NJ,2012,manager,74000
cindy,F,NJ,2012,engineer,81000
janet,F,NJ,2015,clerk,30150
joe,M,NY,2014,dentist,74000
luke,M,CT,2015,dentist,78000
zoe,F,NY,2013,manager,89500
eli,M,CT,2011,sales,66000
anna,F,CT,2012,sales,59500
evan,M,NY,2014,clerk,2920
'

Percentile Facet Examples

Now we can use Solr’s analytics / facet functions to slice and dice our data!

Let’s say we want the 25%, 50%, and 75% percentile salaries across all our jobs:

$ curl http://localhost:8983/solr/demo/query -d 'q=*:*&json.facet={salary_percentiles:"percentile(salary_d,25,50,75)"}'

And at the end of our response, we’ll get our facet results:

[...]
  "facets" : {
    "count" : 15,
    "salary_percentiles" : [51000.0, 74000.0, 79500.0]
  }
}

 
We can add in other statistics such as the average salary, the number of different jobs, and the number of different states in our salary survey:

$ curl http://localhost:8983/solr/demo/query -d 'q=*:*&
json.facet={
  average_salary : "avg(salary_d)",
  num_jobs : "unique(job_s)",
  num_states : "unique(loc_s)",
  salary_percentiles : "percentile(salary_d,25,50,75)"
}'
  "facets":{
    "count":15,
    "average_salary":63374.6,
    "num_jobs":5,
    "num_states":3,
    "salary_percentiles":[51000.0,74000.0,79500.0]
  }

 
Now let’s take a look at median salary broken out by gender:

$ curl http://localhost:8983/solr/demo/query -d 'q=*:*&
json.facet={
  by_gender:{
    type:terms
    field:gender_s,
    facet:{
      median_salary:"percentile(salary_d,50)"
    }
  }
}'
  "facets":{
    "count":15,
    "by_gender":{
      "buckets":[
        {
          "val":"M",
          "count":8,
          "median_salary":62750.0
        },
        {
          "val":"F",
          "count":7,
          "median_salary":81000.0
        }
      ]
    }
  }

 
We can also sort by a percentile statistic. If you request more than one percentile value, the sort will be on the first value in the list requested. Let’s find the top states by 99.9th percentile salary:

$ curl http://localhost:8983/solr/demo/query -d 'q=*:*&
json.facet={
  rich_states:{
    type : terms,
    field : loc_s,
    sort : {sal:desc},  // specifying the sort as a string, like sort:"sal desc" will also work
    facet : {
      sal : "percentile(salary_d,99.9)"
    }
  }
}'
  "facets":{
    "count":15,
    "rich_states":{
      "buckets":[{
          "val":"CT",
          "count":5,
          "sal":109909.19600000001},
        {
          "val":"NY",
          "count":5,
          "sal":89438.00000000001},
        {
          "val":"NJ",
          "count":5,
          "sal":80976.0}]}}

 
We can get even more interesting by nesting facets. How about finding the highest earning occupation (99.9th percentile) for every state?

$ curl http://localhost:8983/solr/demo/query -d 'q=*:*&
json.facet={
  states:{
    type:terms,
    field:loc_s,
    facet:{
      top_jobs:{ // nested terms facet
        type : terms,
        field : job_s,
        sort : "sal desc", // sort will be on first percentile (99.9)
        limit : 1,         // only show top occupation
        facet:{
          sal : "percentile(salary_d,99.9,50,10)"
        }
      }
    } // end facet block for the loc_s field
  }
}'

The response has been omitted since we don’t have enough data for it to be interesting.

 
We can also show how median salary has changed over time for each individual state:

$ curl http://localhost:8983/solr/demo/query -d 'q=*:*&
json.facet={
  states:{
    type:terms,
    field:loc_s,
    facet:{
      over_time:{  // nested range facet
        type : range,
        field : year_i,
        start : 2011,
        end : 2015,
        gap : 1,
        facet:{
          median_salary : "percentile(salary_d,50)"
        }
      }
    } // end facet block for the loc_s field
  }
}'
  "facets":{
    "count":15,
    "states":{
      "buckets":[{
          "val":"CT",
          "count":5,
          "over_time":{
            "buckets":[{
                "val":2011,
                "count":1,
                "median_salary":66000.0},
              {
                "val":2012,
                "count":1,
                "median_salary":59500.0},
              {
                "val":2013,
                "count":0},
              {
                "val":2014,
                "count":1,
                "median_salary":110000.0}]}},
        {
          "val":"NJ",
          "count":5,
          "over_time":{
            "buckets":[{
                "val":2011,
                "count":1,
                "median_salary":21250.0},
              {
                "val":2012,
                "count":2,
                "median_salary":77500.0},
              {
                "val":2013,
                "count":1,
                "median_salary":75000.0},
              {
                "val":2014,
                "count":0}]}},
        {
          "val":"NY",
          "count":5,
          "over_time":{
            "buckets":[{
                "val":2011,
                "count":1,
                "median_salary":42500.0},
              {
                "val":2012,
                "count":1,
                "median_salary":59500.0},
              {
                "val":2013,
                "count":1,
                "median_salary":89500.0},
              {
                "val":2014,
                "count":2,
                "median_salary":38460.0}]}}]}}