Design ConceptsCheck out the Replica Set Design Concepts for some of the core concepts underlying MongoDB Replica Sets. ConfigurationCommand LineWe specify --replSet set_name/seed_hostname_list on the command line. seed_hostname_list is a (partial) list of some members of the set. The system then fetches full configuration information from the collection local.system.replset. set_name is specified to help the system catch misconfigurations. In current versions of MongoDB (1.8+) seed_hostname_list is not required; --replSet set_name will suffice. Node TypesConceptually, we have some different types of nodes:
Each node in the set has a priority setting. On a resync (see below), the rule is: choose as master the node with highest priority that is healthy. If multiple nodes have the same priority, pick the node with the freshest data. For example, we might use 1.0 priority for Normal members, 0.0 for passive (0 indicates cannot be primary no matter what), and 0.5 for a server in a less desirable data center. local.system.replsetThis collection has one document storing the replica set's configuration. See the configuration page for details. Set Initiation (Initial Setup)For a new cluster, on negotiation the max OpOrdinal is zero everywhere. We then know we have a new replica set with no data yet. A special command {replSetInitiate:1}
is sent to a (single) server to begin things. DesignServer States
Applying OperationsSecondaries apply operations from the Primary. Each applied operation is also written to the secondary's local oplog. We need only apply from the current primary (and be prepared to switch if that changes). OpOrdinalWe use a monotonically increasing ordinal to represent each operation. These values appear in the oplog (local.oplog.$main). maxLocalOpOrdinal() returns the largest value logged. This value represents how up-to-date we are. The first operation is logged with ordinal 1. Note two servers in the set could in theory generate different operations with the same ordinal under some race conditions. Thus for full uniqueness we must look at the combination of server id and op ordinal. Picking PrimaryWe use a consensus protocol to pick a primary. Exact details will be spared here but that basic process is:
Any server in the replica set, when it fails to reach master, attempts a new election process. Heartbeat MonitoringAll nodes monitor all other nodes in the set via heartbeats. If the current primary cannot see half of the nodes in the set (including itself), it will fall back to secondary mode. This monitoring is a way to check for network partitions. Otherwise in a network partition, a server might think it is still primary when it is not. Heartbeats requests are sent out every couple of seconds and can either receive a response, get an error, or time out (after ~20 seconds). Assumption of PrimaryWhen a server becomes primary, we assume it has the latest data. Any data newer than the new primary's will be discarded. Any discarded data is backed up to a flat file as raw BSON, to allow for the possibility of manual recovery (see this case for some details). In general, manual recovery will not be needed - if data must be guaranteed to be committed it should be written to a majority of the nodes in the set. FailoverWe renegotiate when the primary is unavailable, see Picking Primary. Resync (Connecting to a New Primary)When a secondary connects to a new primary, it must resynchronize its position. It is possible the secondary has operations that were never committed at the primary. In this case, we roll those operations back. Additionally we may have new operations from a previous primary that never replicated elsewhere. The method is basically:
We can work our way back in time until we find a few operations that are consistent with the new primary, and then stop. Any data that is removed during the rollback is stored offline (see Assumption of Primary, so one can manually recover it. It can't be done automatically because there may be conflicts. Reminder: you can use w= to ensure writes make it to a majority of slaves before returning to the user, to ensure no writes need to be rolled back. ConsensusFancier methods would converge faster but the current method is a good baseline. Typically only ~2 nodes will be jockeying for primary status at any given time so there isn't be much contention:
Increasing DurabilityWe can trade off durability versus availability in a replica set. When a primary fails, a secondary will assume primary status with whatever data it has. Thus, we have some desire to see that things replicate quickly. Durability is guaranteed once a majority of servers in the replica set have an operation. To improve durability clients can call getlasterror and wait for acknowledgement until replication of a an operation has occurred. The client can then selectively call for a blocking, somewhat more synchronous operation. Reading from Secondaries and StalenessSecondaries can report via a command how far behind the primary they are. Then, a read-only client can decide if the server's data is too stale or close enough for usage. Exampleserver-a: secondary oplog: () In the above example, server-c becomes primary after server-a fails. Operations (a4,a5) are lost. c4 and c5 are new operations with the same ordinals. AdministrationSee the Replica Set Commands page for full info. Commands:
Future Versions
See Also |

PLEASE POST QUESTIONS IN THE USER GROUPS FORUM. Post non-question comments and helpful hints here.
blog comments powered by Disqus