Here’s an overview of some of the new features in Solr 6.0.
Download Solr 6 to try these features out and give us feedback!
You can also check out upcoming features of the next Solr release.
Parallel SQL
Parallel SQL queries across SolrCloud collections. The SQL engine is built on top of Solr’s Streaming API (Streaming Expressions), which provides support for parallel relational algebra and real-time map-reduce.
- SQL statements are compiled to Streaming Expressions for parallel execution across SolrCloud worker nodes.
- SolrCloud collections are abstracted as Relational Tables.
- Full support for Lucene/Solr query syntax in the WHERE clause.
- Many operations such as grouping/rollups can be automatically parallelized, utilizing the real-time MapReduce capabilities of the streaming expressions.
- Grouping/rollup operations can be pushed down and leverage the JSON Facet API for increased performance.
- Currently works in SolrCloud mode only… no standalone mode yet.
- SQL functionality is currently experimental and incomplete (for example underlying streaming join functionality is still in the process of being integrated.)
Examples:
select category, count(*), sum(inventory), min(price), max(price), avg(outstanding) from collection1 where text='4k HDTV' group by category order by sum(inventory) asc limit 10
select id,category from collection1 where category = '(dvd OR bluray)' order by category desc limit 100
select fieldA, fieldB, count(*), sum(fieldC), avg(fieldY) from collection1 where fieldC = 'term1 term2' group by fieldA, fieldB having sum(fieldC) > 1000 order by sum(fieldC) asc limit 100
See the Parallel SQL documentation for more info.
/sql
There is a SQL Request Handler mapped to the /sql
endpoint. Only SolrCloud collections can currently be searched with the SQL handler.
Example:
$ curl http://localhost:8983/solr/techproducts/sql -d "stmt=select id from techproducts"
{"result-set":{"docs":[ {"id":"EN7800GTX/2DHTV/256M"}, {"id":"100-435805"}, {"id":"UTF8TEST"}, {"id":"SOLR1000"}, {"id":"9885A004"} ]}}
JDBC driver
Solr has a new JDBC driver that may be used to access the new SQL functionality.
The Solr JDBC driver has been tested with DbVisualizer, Apache Zeppelin, and SQuirreL SQL so far.
The general form of the JDBC connection string is:
jdbc:solr://SOLR_ZK_CONNECTION_STRING?collection=COLLECTION_NAME
The JDBC driver is not yet documented in the Ref Guide, so see https://issues.apache.org/jira/browse/SOLR-8521 for more documentation in the meantime.
Streaming Expressions
Distributed Joins
A number of distributed join operations have been added to streaming expressions:
- innerJoin
- leftOuterJoin
- hashJoin
- outerHashJoin
Example:
innerJoin( search(collection1, q=*:*, fl="fieldA, fieldB, fieldC", ...), search(collection2, q=*:*, fl="fieldA, fieldD, fieldE", ...), on="fieldA=fieldA" )
rollup Streaming Expression
The rollup streaming expression groups tuples by common field values and emits the rollup value along with other specified metrics.
Example:
rollup( search(collection1, qt="/export" q="*:*", fl="id,manu,price", sort="manu asc"), over="manu"), count(*), max(price) )
facet Streaming Expression
The facet
streaming expression is much like the rollup expression, but it pushes down the computation to the leaves using the JSON Facet API.
Example:
facet( techproducts, q="*:*", buckets="manu", bucketSorts="count(*) desc", bucketSizeLimit=1000, count(*), sum(price), max(popularity) )
Streaming Expression Docs
Many other Streaming Expressions were added for the Solr 6 release.
At this point, they should all be documented in the Streaming Expressions section of the Solr Ref Guide.
Graph traversal query
A basic graph traversal query that follows nodes to edges, optionally filtering during traversal.
Example: Assume we have documents that represent people, and each document has a field called “parent_id” which lists the parents.
This example query matches “Philip J. Fry” and all of his ancestors:
fq={!graph from=parent_id to=id}id:"Philip J. Fry"
The main argument to the graph query defines the root set, in this case id:"Philip J. Fry"
. The graph query then iteratively follows the parent_id
field to documents with corresponding id
fields (i.e. for each iteration, the values for parent_id
in the current set are matched to id
field of all documents). The basic graph query is equivalent to a repeated join query.
Graph query parameters:
from
to
traversalFilter
– A filter query that is applied on each iteration.returnRoot
– Controls whether the root set of documents should be included. Defaults to “true”.returnOnlyLeaf
– If true, only returns leaf documents. Defaults to “false”.maxDepth
– The maximum number of iterations before graph traversal stops. Defaults to -1 (unlimited).
– The field used in the current set of documents to match to the to
field in the set of destination documents.
– The field used to find matches in the values obtained from the from
fields of the starting set of documents.
NOTE: This graph query only traverses edges in the same index (i.e. it does not traverse edges across different nodes/cores). Distributed graph traversal is being developed and will be in future versions of Solr 6.x
BM25 scoring
Default scoring now uses Okapi BM25 by default.
You can enable the old tf-idf vector space similarity by using ClassicSimilarity
.
Here is an example of how to use classic tf-idf similarity just for the “text” fieldType in the Solr schema:
<fieldType name="text" class="solr.TextField"> <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/> <similarity class="solr.ClassicSimilarityFactory"/> </fieldType>
For BM25, one can tweak the scoring function on a per-fieldType basis. For example:
<fieldType name="text2" class="solr.TextField"> <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/> <similarity class="solr.BM25SimilarityFactory"> <float name="k1">1.2 <float name="b">0.75 </similarity> </fieldType>
For those new to full-text search terminology, “similarity” produces a score for how similar a document is to a full-text query. When determining this score, document statistics as well as corpus statistics are used. Some scoring factors include:
- The number of times a search term appears in the document field. More matches produces a higher score.
- The size of the document field. Longer fields produce a lower score, with the idea being that for a given number of term matches, shorter is better (more specific match)
- The average length of the field across the entire corpus (BM25 considers this, classic tf-idf does not)
- How common the query terms are across the entire corpus. The idea being that rarer terms carry more information. For example, if I searched for “blue whale”, all else being equal, I’d probably want things about whales to score higher than things about blue.
Filters for Real-Time Get
Real-Time Get (normally handled at the /get URL) now handles filters (fl
parameters) to restrict matching documents.
Example:
curl "http://localhost:8983/solr/demo/get?id=book1,book2,book3&fl=security:group1"
Cross Data Center Replication
An experimental version of CDCR (Cross Data Center Replication) has been added that supports an active-passive configuration.
Updates are asynchronously sent from the active cluster leader to the passive cluster leader. Since updates are asynchronously relayed from the leader’s transaction logs, temporary connectivity issues between data centers can be tolerated with no interruption in service for the primary DC.
Here is the in-progress CDCR documentation. The documentation link is likely to change once CDCR moves out of it’s “experimental” phase.