Count Distinct in Solr


A 100% accurate count of distinct values (count distinct) is not generally possible without actually observing all of the values together. However there are a number of ways to estimate the count.

“unique” Facet Function

The unique facet function is Solr’s fastest implementation to calculate the number of distinct values.
It always provides exact counts on a single Solr node. For distributed search over multiple nodes, it provides exact counts when the number of values per node does not exceed 100 (by default).

When the number of unique values does exceed 100 in any given shard, the following algorithm is used:

  • It estimates the count by sending the top 100 results from each shard along with the total exact “unique” count for each shard.
  • totalSeen is the number of actual results we saw from all shards (i.e. not deduped yet).
  • uniqueSeen is the number of unique values we saw from all shards (i.e. deduped).
  • notSeen is the number of unique values from each shard that were not sent (because of the 100 cutoff).
  • factor = uniqueSeen / totalSeen (i.e. what fraction of values that we saw were unique)
  • estimate = uniqueSeen + ( notSeen * factor ) (i.e. we simply apply the factor to the number of values we didn’t see)

Example use:

$ curl http://localhost:8983/solr/techproducts/query -d '
q=*:*&
json.facet={
  x : "unique(manu_exact)"    // manu_exact is the manufacturer indexed as a single string
}'

For more facet functions, adding facet functions to each facet bucket, or sorting by facet function, see Solr Facet Functions

HyperLogLog

(New in Solr 5.2)
The HyperLogLog algorithm was developed as an advanced statistical method of estimating the distinct number of values without using too much memory. It does add more calculation overhead however, and is thus slower than Solr’s “unique” facet function.

HyperLogLog requires very high quality hashes for accurate estimation. Solr uses a port of MurmurHash3 for Java to calculate these hashes.

The Facet Analytics Module and the older Stats component both have support for HyperLogLog as of Solr 5.2

JSON Facet API HyperLogLog Example

A new “hll” facet function was added as an alternative to the existing faster (but less accurate for high cardinality) “unique” function.

To get the unique number of manufacturers using the HyperLogLog algorithm:

$ curl http://localhost:8983/solr/techproducts/query -d '
q=*:*&
json.facet={
  x : "hll(manu_exact)"    // manu_exact is the manufacturer indexed as a single string
}'

If we want the number of unique manufacturers per bucket of a facet:

$ curl http://localhost:8983/solr/techproducts/query -d '
q=*:*&
json.facet={
  categories: {
    type : terms,
    field : cat,
    facet : {
      x : "hll(manu_exact)"      
    }
  }
}'

And we get a response containing:

  "facets":{
    "count":32,
    "categories":{
      "buckets":[{
          "val":"electronics",
          "count":12,
          "x":9},
        {
          "val":"currency",
          "count":4,
          "x":4},
    [...]

Stats Component HyperLogLog Example

To get the unique number of manufacturers per facet bucket using the Stats component:

$ curl http://localhost:8983/solr/techproducts/query -d '
q=*:*&
stats=true&
facet=true&
stats.field={!tag=stat1 cardinality=true}manu_exact&
facet.pivot={!stats=stat1}cat'

And we get a response containing:

  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{},
    "facet_dates":{},
    "facet_ranges":{},
    "facet_intervals":{},
    "facet_heatmaps":{},
    "facet_pivot":{
      "cat":[{
          "field":"cat",
          "value":"electronics",
          "count":12,
          "stats":{
            "stats_fields":{
              "manu_exact":{
                "cardinality":9}}}},
        {
          "field":"cat",
          "value":"currency",
          "count":4,
          "stats":{
            "stats_fields":{
              "manu_exact":{
                "cardinality":4}}}},
   [...]

Cardinality Performance

Here is a performance comparison of the different implementations in various scenarios.

Test configuration:

Index:
  documents: 5M
  index segments: 25
  index size: 1.74GB
  6 single valued string fields with 10, 100, 1000, 10000, 100000, 1000000 unique values respectively.
  6 single valued integer fields as above.
  6 multi-valued string fields with 1-5 values per field, with 10, 100, 1000, 10000, 100000, 1000000 unique values respectively.
  6 multi-valued integer fields as above.
  5% chance of any given field having no values for a particular document.

Queries:
  Base test query and filters (the domain) matches 2,161,827 documents.
  Single client thread (and both requests only use a single internal thread per request).
  Single warm-up run per implementation that is discarded.
  Multiple runs across all fields, with fastest time being taken for each field.

Legacy Faceting command (stats component + pivot faceting):

facet=true&stats=true&stats.field={!tag=stat1+cardinality=true}s10_s&facet.pivot={!stats=stat1}m100_5_ss&f.m100_5_ss.facet.limit=10

JSON Faceting command (new Facet Module):

json.facet={
  f:{
    type : terms,
    field : m100_5_ss,
    facet : { stat1:"hll(s10_s)" }
  }
}

This first test facets on the multi-valued string field m100_5_ss (it has up to 5 values per field, and 100 unique values in total). Then for the top 10 buckets, the cardinality of the single-valued string field (the “stat field”) is calculated. The chart below shows performance for different number of unique values in the stat field.

This next test reverses the 2 fields above, first faceting on the single valued string field s100_s (it has 100 unique values in total) and then calculating cardinality over multi-valued string fields with different numbers of unique values in the index.

The last test facets on the integer field s100_i and then calculates cardinality over another integer field with varying number of unique terms in the index.