Here’s an overview of some of the new features in Solr 7.0
Download Solr 7 to try these features out and give us feedback!
Point Numeric Fields
The now deprecated trie-based numeric fields use (and abuse) the full-text index to index parts of numbers to speed up range queries. The new Points-based numeric fields do not use the full-text index, but instead have a dedicated index structure designed specifically for multidimensional ranges over numbers: the BKD tree. The BKD tree index structure is smaller and faster for range queries, but is slightly slower for exact match queries.
Since Point fields do not currently support un-inversion (i.e. FieldCache), some search functionality requires that points have docValues enabled for fast per-document lookups. This includes sorting, function queries, {!graph}
, and {!join}
queries.
Current template schemas define types like the following:
<fieldType name="pint" class="solr.IntPointField" docValues="true"/> <fieldType name="pints" class="solr.IntPointField" docValues="true" multiValued="true"/>
Along with corresponding dynamic field types:
<dynamicField name="*_i" type="pint" indexed="true" stored="true"/> <dynamicField name="*_is" type="pints" indexed="true" stored="true"/>
Interested in how the underlying BKD data structure works? Here’s a great lesson from Robert Sedgewick, Princeton University on KD trees (BKD trees are just a variant of KD trees):
Distributed Facet Refinement
The JSON Facet API has a new parameter called refine
that turns on two-phase refinement of partial facets during distributed faceting. This guarantees that the statistics (counts or other metrics) within returned facet buckets are accurate.
Partial facets are those facets that specify a limit
and thus may not return all facet buckets from all shards in a distributed search. If refine
is set to true
on one of these partial facets, a second phase is used to “refine” the top buckets from the first phase, collecting information from other shards that did not yet contribute to those buckets. Without refinement, counts and statistics for the bucket can be incorrect. The second phase of faceting normally does not cause any additional HTTP requests since they are piggy-backed onto the normal second phase of distributed search that retrieves stored fields for the top document ids.
Example:
json.facet={ x : { type : terms, field : cat, limit : 5, refine : true } }
min/max Aggregations support Strings
The min/max facet aggregations (facet functions) have been extended beyond numeric fields and functions to include single-valued string fields.
New Streaming Evaluators
Streaming Expressions has many new functions:
- movingAvg
- arraySort
- cumulative
- anova
- hist
- array
- sequence
- finddelay
- knn
- describe
- copyOfRange
- sql
- copyOf
- distance
- scale
- rank
- length
- reverse
Replica Types
SolrCloud now has support for different replica types.
NRT
NRT stands for Near Real Time. This is the default and original replica type in SolrCloud. Updates flow from the leader to all replicas and are added to replica transaction logs as well as indexed. This is the only type of replica to support soft commits since TLOG and PULL replicas need hard commits to copy over new index segments.
TLOG
Updates flow from the leader to all replicas and are added to replica transaction logs (tlogs) only. Replicas are kept up-to-date by pulling new index segment files from the leader.
The transaction logs allow these replicas to recover and become a leader if necessary, as well as directly service real-time get requests.
PULL
Updates do not flow from the leader to replicas. Replicas are kept up-to-date by pulling new index segment files from the leader. This is similar to the original non-SolrCloud master-slave replication.
Replicas of this type may not become leaders since updates would most likely be lost.
Real-time get requests to a PULL replica are forwarded to the leader since these replicas lack transaction logs to find the most recent uncommitted updates.
Other Changes
- The default response format has been changed from XML to indented JSON. Add
wt=xml
to the request obtain an XML response, and addindent=off
if you wish to turn off indenting. - The new v2 API, exposed at /api/ is now the preferred API (esp for using the collections API), but /solr/ continues to be supported.
- The alternate “Analytics Component” in the contrib modules was bumped to v2, with distributed support and a new JSON-based request syntax
- Auto-scaling framework that allows Solr to place new replicas based on metrics such as free disk space.
- The standard lucene/solr query parser now defaults to
sow=false
, meaning that for text fields, it does not split on whitespace before handing the text to the analyzer. This enabled multi-word synonyms to be matched by the analyzer. - When a collection is created without specifying a configset, the new ‘_default’ configset is now used. It is data-driven (schemaless), and indexes strings as analyzed text in addition to using a copyField to
*_str
field suitable for sorting or faceting.
See CHANGES for more detailed upgrade notes.
Also see the official release notes on the Solr wiki.
The Solr Reference Guide should contain other upgrade info.