Solr needs a flexible cross-datacenter architecture that can handle both a variety of application needs as well as a variety of infrastructure resources.
Design Goals
- Accommodate 2 or more data centers
- Accommodate active/active uses
- Accommodate limited band-with cross-datacenter connections
- Minimize coupling between peer clusters to increase reliability
- Support both full consistency and eventual consistency
Issues with running SolrCloud cross data center
Running a single SolrCloud cluster across two data centers can be done, but has multiple drawbacks:
- Same update is forwarded multiple times (once per replica) over the bandwidth limited cross-DC pipe.
- Can’t implement extra compression or security
- Requires a 3rd data center to contain a zookeeper node for tie-breaking
- Burst indexing limited by cross-DC bandwidth.
- Lack of a true backup cluster if the cluster gets into a bad state
- Extra latency for all indexing operations
- Search requests are not data center aware (extra latency + bandwidth)
- Normal recovery mechanism (full index copy) may not be viable across DC connections
Architecture Overview
Clusters will be configured to know about each other, most likely through keeping a cluster peer list in zookeeper. One essential piece of information will be the zookeeper quorum address for each cluster peer. Any node in one cluster can know the configuration of another cluster via a zookeeper client.
Update flow will go from the shard leader in one cluster to the shard leader in the peer clusters. This can be bi-directional, with updates flowing in both directions. Updates can be either synchronous or asynchronous, with per-update granularity.
Solr transaction logs are currently removed when no longer needed. They will be kept around (potentially much longer) to act as the source of data to be sent to peer clusters. Recovery can also be bi-directional with each peer cluster sending the other cluster missed updates.
Architecture Features & Benefits
- Scalable – no required single points of aggregation / dissemination that could act as a bottleneck.
- Per-update choice of synchronous/asynchronous forwarding to peer clusters.
- Peer clusters may have different configuration, such as replication factor.
- Asynchronous updates allow for bursts of indexing throughput that would otherwise overload cross-DC pipes.
- “Push” operation for lowest latency async updates.
- Low-overhead… re-uses Solr’s existing transaction logs for queuing.
- Leader-to-leader communication means update is only sent over cross-DC connection once.
Update Flow
- An update will be received by the shard leader and versioned
- Update will be sent from the leader to it’s replicas
- Concurrently, update will be sent (synchronously or asynchronously) to the shard leader in other clusters
- Shard leader in the other cluster will receive already versioned update (and not re-version it), and forward the update to it’s replicas
Solr Document Versioning
The shard leader versions a document and then forwards it to replicas. Update re-orders are handled by the receiver by dropping updates that are detected to be older than the latest document version in the index. This works given that complete documents are always sent to replicas, even if it started as a partial update on the leader.
Solr version numbers are derived from a timestamp (the high bits are milliseconds and the low bits are incremented for each tie in the same millisecond to guarantee a monotonically increasing unique version number for any given leader).
The Clock Skew Problem
If updates are accepted for the same document in two different clouds (implying two different leaders versioning the document), then having the correct last document “win” relies on clock synchronization between the two leaders. Updates to the same document at different data centers within the clock skew time risk being incorrectly ordered.
The Partial Update Problem
Solr only has versions at the document level. The current partial update implementation (because of other constraints) reads the current stored fields of the document, makes the requested update, and indexes the new resulting document. This creates a problem with accepting Solr atomic updates / partial updates to the same document in both data-centers.
Example:
DC1: writes document A, version=time1 DC2: receives document A (version=time1) update from DC1 DC1: updates A.street_address (Solr reads version time1, writes version time2) DC2: updates A.phone_number (Solr reads version time1, writes version time3) DC1: receives document A (version=time3) from DC2, writes it. DC2: received document A (version=time2) from DC1, ignores it (older version)
Although both data-centers became “consistent”, the partial update of street_address was completely lost in the process.
Solutions
Option 1:
Configure the update for full synchronization. All peer clusters must be available for any to be writeable.
Option 2:
Use client versioning, where the update clients specify a user-level version field.
Option 3:
For a given document, consider one cluster the primary for the purposes of document changes/updates. See “Primary Cluster Routing”.
Primary Cluster Routing
To deal with potential update conflicts arising from updating the same document in different data centers, each document can have a primary cluster.
A routing enhancement can ensure that a document sent to the wrong cluster will be forwarded to the correct cluster.
Routing can take as input a request parameter, a document field, or the unique id field. The primary cluster could be determined by hash code (essentially random), or could be determined by a mapping specified in the cluster peer list. Changes to this mapping for fail-over would not happen automatically in Solr. If a data center becomes unreachable, the application/client layers have responsibility for deciding that a different cluster should become the primary for that set of documents.
Primary cluster routing will be optional. Many applications will naturally not trigger the type of undesirable update behavior described, or will have the ability to work around update limitations.
Future Option: Improve Partial Updates
Implement true partial updates with vector clocks and/or finer grained versioning so that updates to different fields can be done conflict free if re-ordered. This would also lower the bandwidth costs of partial updates since the entire document would no longer be sent to all replicas and to other peer clusters.
Future Option: Update Aggregators
One could potentially further minimize cross-DC traffic by introducing traffic aggregator nodes (one per cluster) that all udpates would flow through. This would likely only improve bandwidth utilization in low update environments. The improvements would come from fewer connections (and hence less connection overhead) and better compression (a block of many small updates would generally have a better compression ratio than the same updates compressed individually).
Future Option: Clusterstate proxy
Many zookeeper clients in a peer cluster could generate significant amounts of traffic between data centers. There could be a designated listener to the remote cluster state that could disseminate this state to others in the local cluster rather than hitting ZK directly.
Also worth investigating is the use of a local zookeeper observer node that could service all local ZK reads for the remote ZK quorum.
Future Option: Multi-Cluster Operations
The first phase of this design only deals with updates. Collection level operations such as adding a new shard, splitting a shard, and changing replication levels, must be performed by the client on every cluster as applicable.
The collections API (and other higher level APIs) could be made peer-aware such that these operations would also be forwarded to peer clusters, as well as including a queuing mechanism for the cases when a peer cluster is unreachable.
Implementation Notes
Forwarding of updates from one cloud to another should be done via standard SolrJ client. Any needed enhancements/modification should be done to a SolrJ client such that those enhancements may also be used in other contexts.