The percentile
aggregation function was just added to the new Solr Facet Module.
This allows one to calculate one or more percentiles for each facet bucket (i.e. each group of documents produced by faceting), and even sort facet buckets by any given percentile.
The percentile
aggregation even works with distributed search!
The algorithm used is Ted Dunnings “t-digest”, which gives good approximations with relatively little memory consumption.
Percentiles Quick Start
NOTE: requires Solr 5.3 or later.
First, let’s start Solr and create a “demo” collection.
$ bin/solr start $ bin/solr create -c demo # HINT: use "bin/solr stop -all" when you're finished.
Now, lets index some salary survey data in CSV format, using dynamic fields:
$ curl http://localhost:8983/solr/demo/update?commitWithin=5000 -H 'Content-type:text/csv' -d ' id,gender_s,loc_s,year_i,job_s,salary_d mark,M,NJ,2011,clerk,21250 john,M,NY,2011,engineer,42500 mary,F,CT,2015,manager,87299 alice,F,NJ,2013,dentist,75000 mike,M,NY,2012,sales,59500 nancy,F,CT,2014,engineer,110000 greg,M,NJ,2012,manager,74000 cindy,F,NJ,2012,engineer,81000 janet,F,NJ,2015,clerk,30150 joe,M,NY,2014,dentist,74000 luke,M,CT,2015,dentist,78000 zoe,F,NY,2013,manager,89500 eli,M,CT,2011,sales,66000 anna,F,CT,2012,sales,59500 evan,M,NY,2014,clerk,2920 '
Percentile Facet Examples
Now we can use Solr’s analytics / facet functions to slice and dice our data!
Let’s say we want the 25%, 50%, and 75% percentile salaries across all our jobs:
$ curl http://localhost:8983/solr/demo/query -d 'q=*:*&json.facet={salary_percentiles:"percentile(salary_d,25,50,75)"}'
And at the end of our response, we’ll get our facet results:
[...] "facets" : { "count" : 15, "salary_percentiles" : [51000.0, 74000.0, 79500.0] } }
We can add in other statistics such as the average salary, the number of different jobs, and the number of different states in our salary survey:
$ curl http://localhost:8983/solr/demo/query -d 'q=*:*& json.facet={ average_salary : "avg(salary_d)", num_jobs : "unique(job_s)", num_states : "unique(loc_s)", salary_percentiles : "percentile(salary_d,25,50,75)" }'
"facets":{ "count":15, "average_salary":63374.6, "num_jobs":5, "num_states":3, "salary_percentiles":[51000.0,74000.0,79500.0] }
Now let’s take a look at median salary broken out by gender:
$ curl http://localhost:8983/solr/demo/query -d 'q=*:*& json.facet={ by_gender:{ type:terms field:gender_s, facet:{ median_salary:"percentile(salary_d,50)" } } }'
"facets":{ "count":15, "by_gender":{ "buckets":[ { "val":"M", "count":8, "median_salary":62750.0 }, { "val":"F", "count":7, "median_salary":81000.0 } ] } }
We can also sort by a percentile statistic. If you request more than one percentile value, the sort will be on the first value in the list requested. Let’s find the top states by 99.9th percentile salary:
$ curl http://localhost:8983/solr/demo/query -d 'q=*:*& json.facet={ rich_states:{ type : terms, field : loc_s, sort : {sal:desc}, // specifying the sort as a string, like sort:"sal desc" will also work facet : { sal : "percentile(salary_d,99.9)" } } }'
"facets":{ "count":15, "rich_states":{ "buckets":[{ "val":"CT", "count":5, "sal":109909.19600000001}, { "val":"NY", "count":5, "sal":89438.00000000001}, { "val":"NJ", "count":5, "sal":80976.0}]}}
We can get even more interesting by nesting facets. How about finding the highest earning occupation (99.9th percentile) for every state?
$ curl http://localhost:8983/solr/demo/query -d 'q=*:*& json.facet={ states:{ type:terms, field:loc_s, facet:{ top_jobs:{ // nested terms facet type : terms, field : job_s, sort : "sal desc", // sort will be on first percentile (99.9) limit : 1, // only show top occupation facet:{ sal : "percentile(salary_d,99.9,50,10)" } } } // end facet block for the loc_s field } }'
The response has been omitted since we don’t have enough data for it to be interesting.
We can also show how median salary has changed over time for each individual state:
$ curl http://localhost:8983/solr/demo/query -d 'q=*:*& json.facet={ states:{ type:terms, field:loc_s, facet:{ over_time:{ // nested range facet type : range, field : year_i, start : 2011, end : 2015, gap : 1, facet:{ median_salary : "percentile(salary_d,50)" } } } // end facet block for the loc_s field } }'
"facets":{ "count":15, "states":{ "buckets":[{ "val":"CT", "count":5, "over_time":{ "buckets":[{ "val":2011, "count":1, "median_salary":66000.0}, { "val":2012, "count":1, "median_salary":59500.0}, { "val":2013, "count":0}, { "val":2014, "count":1, "median_salary":110000.0}]}}, { "val":"NJ", "count":5, "over_time":{ "buckets":[{ "val":2011, "count":1, "median_salary":21250.0}, { "val":2012, "count":2, "median_salary":77500.0}, { "val":2013, "count":1, "median_salary":75000.0}, { "val":2014, "count":0}]}}, { "val":"NY", "count":5, "over_time":{ "buckets":[{ "val":2011, "count":1, "median_salary":42500.0}, { "val":2012, "count":1, "median_salary":59500.0}, { "val":2013, "count":1, "median_salary":89500.0}, { "val":2014, "count":2, "median_salary":38460.0}]}}]}}