Solr Terms Query for matching many terms


Solr 4.10 and Heliosearch .07 have added a terms query (or terms filter) to more efficiently match many terms in a single field.
A large number of terms are often useful for things like access control lists or security filters. Previously, the only way to do this was a large boolean query with many clauses, which has unnecessary overhead when scoring is not needed.

Solr’s implementation uses Lucene’s TermFilter class, as does Elasticsearch’s terms filter.

The Heliosearch terms query implementation has some additional features:

  • prefix compression including off-heap construction
  • direct creation of off-heap filter for faster execution and less garbage production
  • native code bit-setting
  • ability to skip sorting the terms if desired

Query Syntax

Boolean Query Syntax

For reference, specifying a filter query (fq) in the normal lucene syntax via a boolean query looks like the following (assumes default boolean operator of OR):

fq=id:doc334 id:doc125 id:doc777 id:doc321 id:doc253

or in a more compact form, like:

fq=id:(doc334 doc125 doc777 doc321 doc253)

Be aware that going over the limit of 1024 terms in Solr will cause an exception by default. Heliosearch has no such limit.

Terms Query Syntax

The corresponding new terms query in both Solr and Heliosearch is:

fq={!terms f=id}doc334,doc125,doc777,doc321,doc253

Terms Query Performance

Performance of terms queries is shown relative to using a Boolean query in Solr.
For example the last column in the first chart represents a 10 term filter that matches 10,000,000 documents (1 million per term). The request execution time is:

  • 381,342 microseconds with a Solr Boolean Querty
  • 122,119 microseconds with a Solr Terms Query
  • 67,075 microseconds with a Heliosearch Terms Query

Benchmark details:

  • 10M document index
  • 64 bit Java 1.8.0_20 Oracle JDK
  • Windows 8 64 bit, quad-core Intel i5-3570K @ 3.4GHz
  • Request time was measured externally and includes the entire request time, including the time for the client to send the request and read the response.
  • Solr versions: Apache Solr 4.10.0, Heliosearch 0.07 (based on Solr 4.10)

 
The first set of tests consist of 10 term queries that match various number of documents.:
terms_perf_10

 
The next set of tests deal with 100 term queries that match various number of documents:
terms_perf_100

 
And finally the last test deals with term queries on the id field (i.e. each term matches a single document):
terms_perf_ids

Memory Consumption and Garbage Production

The first performance tests were run multiple times and the amount of garbage produced was recorded.
terms_perf_memory

The Heliosearch off-heap optimizations clearly pay dividends here, resulting in much less heap usage, less garbage production (which will mean less garbage collection work), and a smaller process size.