Solr 4.8 Features


Solr 4.8 has been released. Here’s an overview of how to use some of the new features.
Also see Solr download links and upcoming features of the next Solr release.

Complex Phrase Queries

The complexphrase query parser can produce phrase queries with embedded wildcards and boolean queries.
It works via multiple passes, parsing a query and then re-parsing any phrase queries for additional markup. At query execution time, span queries are generated to implement the complex phrase logic.

The simplest example is a phrase query containing a prefix query:

q={!complexphrase}"apple ip*"

This will match text with both “apple ipod” and “apple ipad”.
One can specify inOrder=false as a localParam to also match “ipod apple” and “ipad apple”.

q={!complexphrase inOrder=false}"apple ip*"

One can also specify a different default field to search with the df localParam:

q={!complexphrase df=name}"john* smith"

This will match both “john smith” and “johnathan smith” in the name field. Of course one could always specify the field directly in the query as well:

q={!complexphrase}name:"john* smith"

Phrase slop works to specify the proximity of the clauses. For example, the following would also match a name of “johnathan q smith”:

q={!complexphrase}name:"john* smith"~1

And of course we can throw in parens, OR clauses, and other complex logic as well:

q={!complexphrase}name:"(aaa OR (bbb* OR ccc)) ddd -eee (fff~1 OR ggg)" AND text:"nnn? (ooo OR ppp) -qqq www"~3

 

Indexing Child Documents in JSON

Previously, one had to use XML or binary format (or SolrJ) to index nested child documents (needed for block join functionality). Support has now been added for JSON:

curl http://localhost:8983/solr/update/json?softCommit=true -H 'Content-type:application/json' -d '
[
  {
    "id": "chapter1",
    "title" : "Indexing Child Documents in JSON",
    "content_type": "chapter",
    "_childDocuments_": [
      {
        "id": "1-1",
        "content_type": "page",
        "text": "ho hum... this is page 1 of chapter 1"
      },
      {
        "id": "1-2",
        "content_type": "page",
        "text": "more text... this is page 2 of chapter 1"
      }
    ]
  }
]
'

Block Join Example

Now if we query on “ho hum”, we obviously get page 1 of chapter 1 back:

http://localhost:8983/solr/query?q="ho hum"
[...]
 "response":{"numFound":1,"start":0,"docs":[
      {
        "id":"1-1",
        "content_type":["page"]}]
  }

But if we wanted to select chapters based on matches in pages, we could utilize a parent block join:

http://localhost:8983/solr/query?q={!parent which='content_type:chapter'}"ho hum"
[...]
  "response":{"numFound":1,"start":0,"docs":[
      {
        "id":"chapter1",
        "content_type":["chapter"]}]
  }

A child block join can be used to restrict (or match) child pages based on matches in a chapter (parent). For example, the following request returns all pages for which the chapter title contains “Indexing”:

http://localhost:8983/solr/query?q={!child of=content_type:chapter}title:Indexing
[...]
"response":{"numFound":2,"start":0,"docs":[
      {
        "id":"1-1",
        "content_type":["page"]},
      {
        "id":"1-2",
        "content_type":["page"]}]
  }

The query above would probably be more useful as a filter… for example, if we wanted to search for “hum” on all pages where the chapter had “Indexing” in the title:

http://localhost:8983/solr/query?q=hum&fq={!child of=content_type:chapter}title:Indexing
[...]
 "response":{"numFound":1,"start":0,"docs":[
      {
        "id":"1-1",
        "content_type":["page"]}]
  }

Solr 5.3 and later has the ability to combine faceting and nested objects / block join.

 

Expand Component

The ExpandComponent can be used to expand parent/child relationships in Solr. Joel previously blogged about the Expand Component and gave an example of how it could be used to expand a block join.

 

Named Config Sets

This is more in the “configuration” category of features. SolrCloud has always allowed multiple collections to share configuration, and now that capability has been brought to Solr’s non-cloud mode.

Since collections can be created or destroyed, we obviously don’t want shared configuration for these collections to be under the collection itself. The default location for config sets is in the “configsets” directory under the solr home (the example solr server currently doesn’t have this directory by default).

Let’s create a configSet named “generic” and then create two new collections (single core) called “books” and “music”:

~/solr/example$ mkdir -p solr/configsets/generic/conf/
~/solr/example$ cp -r solr/collection1/conf/* solr/configsets/generic/conf/
~/solr/example$ curl 'http://localhost:8983/solr/admin/cores?action=CREATE&name=books&configSet=generic'
~/solr/example$ curl 'http://localhost:8983/solr/admin/cores?action=CREATE&name=music&configSet=generic'

Now you should be able to go to the admin console http://localhost:8983/solr and go to the “Core Selector” on the bottom left hand side to see the new cores/collections we just created.

Let’s inspect what was done from the command line:

~/solr/example$ ls -F solr
README.txt   bin/         books/       collection1/ configsets/  music/       solr.xml     zoo.cfg
~/solr/example$ ls -F solr/books
core.properties  data/
~/solr/example$ cat solr/books/core.properties 
#Written by CorePropertiesLocator
#Thu Apr 24 21:12:33 EDT 2014
name=books
configSet=generic

So we can see that the new cores created only contain a data directory and lack a “conf” directory of their own. The core.properties file points to the correct named configSet.

 

Stopwords and Synonyms REST API

Stopwords and Synonyms may now be managed via a REST API!
The new analysis filter types are ManagedStopFilterFactory and ManagedSynonymFilterFactory.
The example schema.xml now contains a field type that uses these new analysis filters:

    <!-- A text type for English text where stopwords and synonyms are managed using the REST API -->
    <fieldType name="managed_en" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.ManagedStopFilterFactory" managed="english" />
        <filter class="solr.ManagedSynonymFilterFactory" managed="english" />
      </analyzer>
    </fieldType>

To test this out, let’s also change the dynamic field *_en to use managed_en:

   <dynamicField name="*_en"  type="managed_en"    indexed="true"  stored="true" multiValued="true"/>

Synonyms

After starting the example server, we can retrieve the current english synonyms:

curl "http://localhost:8983/solr/collection1/schema/analysis/synonyms/english"
[...]
    "managedMap":{
      "gb":["gib",
        "gigabyte"],
      "happy":["glad",
        "joyful"],
      "tv":["television"]}}}

 
Lets add a new synonym:

curl -XPUT "http://localhost:8983/solr/collection1/schema/analysis/synonyms/english" -H 'Content-type:application/json' --data-binary '{"mb":["MiB","megabyte"]}'

 
Before these changes are visible to the actual search or indexing code in Solr, we need to reload the Solr core:

curl "http://localhost:8983/solr/admin/cores?action=RELOAD&core=collection1"

 
And now we can do a query on a field that matches the dynamicField we set up and can see the results of the new synonym:

curl "http://localhost:8983/solr/query?q=foo_en:mb&debugQuery=true"
[...]
  "debug":{
    "rawquerystring":"foo_en:mb",
    "querystring":"foo_en:mb",
    "parsedquery":"(foo_en:megabyte foo_en:mib)/no_coord",
    "parsedquery_toString":"foo_en:megabyte foo_en:mib",

 
To delete the stopword we just added:

curl -XDELETE "http://localhost:8983/solr/collection1/schema/analysis/synonyms/english/mb"

Stopwords

To retrieve the list of stopwords:

curl "http://localhost:8983/solr/collection1/schema/analysis/stopwords/english"

To add a new stopword:

curl -XPUT "http://localhost:8983/solr/collection1/schema/analysis/stopwords/english" -H 'Content-type:application/json' --data-binary '["foo"]'

To delete the stopword we just added:

curl -XDELETE "http://localhost:8983/solr/collection1/schema/analysis/stopwords/english/foo"

 

Other changes

There have been numerous SolrCloud changes, including:

  • A new List collections and cluster status API which clients can use to read collection and shard information instead of reading data directly from ZooKeeper.
  • Some long running SolrCloud commands (like shard splitting) may now be run in “async” mode to avoid client timeouts
  • A new ADDREPLICA command in the Collections API

Other changes include:

  • Solr 4.8 now requires Java7!
  • RegexReplaceProcessorFactory now supports pattern capture group substitution in the replacement string.
  • A DocExpirationUpdateProcessorFactory that can mark documents based on a TTL (time-to-live) and periodically delete expired documents