Solr 4.10 and Heliosearch .07 have added a
terms query (or terms filter) to more efficiently match many terms in a single field.
A large number of terms are often useful for things like access control lists or security filters. Previously, the only way to do this was a large boolean query with many clauses, which has unnecessary overhead when scoring is not needed.
Solr’s implementation uses Lucene’s TermFilter class, as does Elasticsearch’s terms filter.
The Heliosearch terms query implementation has some additional features:
- prefix compression including off-heap construction
- direct creation of off-heap filter for faster execution and less garbage production
- native code bit-setting
- ability to skip sorting the terms if desired
Boolean Query Syntax
For reference, specifying a filter query (fq) in the normal lucene syntax via a boolean query looks like the following (assumes default boolean operator of
fq=id:doc334 id:doc125 id:doc777 id:doc321 id:doc253
or in a more compact form, like:
fq=id:(doc334 doc125 doc777 doc321 doc253)
Be aware that going over the limit of 1024 terms in Solr will cause an exception by default. Heliosearch has no such limit.
Terms Query Syntax
The corresponding new
terms query in both Solr and Heliosearch is:
Terms Query Performance
terms queries is shown relative to using a Boolean query in Solr.
For example the last column in the first chart represents a 10 term filter that matches 10,000,000 documents (1 million per term). The request execution time is:
- 381,342 microseconds with a Solr Boolean Querty
- 122,119 microseconds with a Solr Terms Query
- 67,075 microseconds with a Heliosearch Terms Query
- 10M document index
- 64 bit Java 1.8.0_20 Oracle JDK
- Windows 8 64 bit, quad-core Intel i5-3570K @ 3.4GHz
- Request time was measured externally and includes the entire request time, including the time for the client to send the request and read the response.
- Solr versions: Apache Solr 4.10.0, Heliosearch 0.07 (based on Solr 4.10)
Memory Consumption and Garbage Production
The Heliosearch off-heap optimizations clearly pay dividends here, resulting in much less heap usage, less garbage production (which will mean less garbage collection work), and a smaller process size.