Replica Sets - Oplog

Replication of data between nodes is done using a special collection known as the oplog.

See also : Replication Oplog Length.

The basic replication process

  1. All write operations are sent to the server (Insert, Update, Remove, DB/Collection/Index creation/deletion, etc.)
  2. That operation is written to the database.
    • That operation is also written to the oplog.
  3. Replicas (slaves) listen to the oplog for changes (known as “tailing the oplog”).
  4. Each secondary copies the (idempotent) operation to their own oplog and applies the operations to their data.
  5. This read + apply step is repeated

The oplog collection

  • The oplog is a special collection type known as a capped collection. The oplog is a collection of fixed size containing information about the operation and a timestamp for that operation. The timestamps are in UTC, regardless of the host's default time zone.
  • Because the oplog has a fixed size, it over-writes old data to make room for new data. At any given time, the oplog only contains a finite history of operations.

Falling Behind

  • Each secondary keeps track of which oplog items have been copied, and applied locally. This allows the secondary to have a copy of the primary's oplog, which is consistent across the replicaset.
  • If a secondary falls behind for a short period of time, it will make a best effort to “catch-up”.

Example:

  • A secondary needs 5 minutes of downtime to be rebooted.
  • When this computer comes online, the mongod process will compare its oplog to that of the Master.
  • mongod will identify that it is 5 minutes behind.
  • mongod will begin processing the primary’s oplog sequentially until it is “caught up”.

This is the ideal situation. The oplog has a finite length, so it can only contain a limited amount of history.

Becoming Stale

  • If a secondary falls too far behind the primary’s oplog that node will become stale.
  • A stale member will stop replication since it can no longer catch up through the oplog.

Example:

  • an oplog contains 20 hours of data
  • a secondary is offline for 21 hours
  • that secondary will become stale, it will stop replicating

If you have a stale replica, see the documents for resyncing.

Preventing a Stale Replica

  • The oplog should be large enough to allow for unplanned downtime, replication lag (due to network or machine load issues), and planned maintanance.
  • The size of the oplog is configured at startup using the --oplogSize command-line parameter. This value is used when you initialize the set, which is the time when the oplog is created (1.7.2+). If you change the --oplogSize parameter later, it has no effect on your existing oplog.
  • There is no easy formula for deciding on an oplog size. The size of each write operation in the oplog is not fixed.
    • Running your system for a while is the best way to estimate the space required. The size and amount of time in your oplog is related to the types and frequency of your writes/updates.
  • Recovering a stale replica is similar to adding a new replica.
A completely re-sync can often take a long time, especially with large datasets. To be able to bring up a new replica “from scratch” ensure that you have a large enough oplog to cover the time to re-sync.

See Also

Follow @mongodb

MongoDB Pittsburgh - May 15
MongoNYC - May 23
MongoDB Paris - Jun 14
MongoDB UK - Jun 20
MongoDC - June 26


Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.

PLEASE POST QUESTIONS IN THE USER GROUPS FORUM. Post non-question comments and helpful hints here.

blog comments powered by Disqus