Replication of data between nodes is done using a special collection known as the oplog.
See also : Replication Oplog Length.
The basic replication process
- All write operations are sent to the server (Insert, Update, Remove, DB/Collection/Index creation/deletion, etc.)
- That operation is written to the database.
- That operation is also written to the oplog.
- Replicas (slaves) listen to the oplog for changes (known as “tailing the oplog”).
- Each secondary copies the (idempotent) operation to their own oplog and applies the operations to their data.
- This read + apply step is repeated
The oplog collection
- The oplog is a special collection type known as a capped collection. The oplog is a collection of fixed size containing information about the operation and a timestamp for that operation. The timestamps are in UTC, regardless of the host's default time zone.
- Because the oplog has a fixed size, it over-writes old data to make room for new data. At any given time, the oplog only contains a finite history of operations.
Falling Behind
- Each secondary keeps track of which oplog items have been copied, and applied locally. This allows the secondary to have a copy of the primary's oplog, which is consistent across the replicaset.
- If a secondary falls behind for a short period of time, it will make a best effort to “catch-up”.
Example:
- A secondary needs 5 minutes of downtime to be rebooted.
- When this computer comes online, the mongod process will compare its oplog to that of the Master.
- mongod will identify that it is 5 minutes behind.
- mongod will begin processing the primary’s oplog sequentially until it is “caught up”.
This is the ideal situation. The oplog has a finite length, so it can only contain a limited amount of history.
Becoming Stale
- If a secondary falls too far behind the primary’s oplog that node will become stale.
- A stale member will stop replication since it can no longer catch up through the oplog.
Example:
- an oplog contains 20 hours of data
- a secondary is offline for 21 hours
- that secondary will become stale, it will stop replicating
If you have a stale replica, see the documents for resyncing.
Preventing a Stale Replica
- The oplog should be large enough to allow for unplanned downtime, replication lag (due to network or machine load issues), and planned maintanance.
- The size of the oplog is configured at startup using the --oplogSize command-line parameter. This value is used when you initialize the set, which is the time when the oplog is created (1.7.2+). If you change the --oplogSize parameter later, it has no effect on your existing oplog.
- There is no easy formula for deciding on an oplog size. The size of each write operation in the oplog is not fixed.
- Running your system for a while is the best way to estimate the space required. The size and amount of time in your oplog is related to the types and frequency of your writes/updates.
- Recovering a stale replica is similar to adding a new replica.
 | A completely re-sync can often take a long time, especially with large datasets. To be able to bring up a new replica “from scratch” ensure that you have a large enough oplog to cover the time to re-sync. |
See Also
PLEASE POST QUESTIONS IN THE USER GROUPS FORUM. Post non-question comments and helpful hints here.
blog comments powered by Disqus