Data Replication keeps copies of the same data on independent nodes. Copies let the system survive node loss, serve reads from the nearest or least-loaded replica, and scale read throughput. The hard part is not making copies — it is deciding what a reader sees while copies are briefly out of step.
Strategies sit on two axes: where writes are accepted (single-leader, multi-leader, leaderless) and when copies converge (synchronous or asynchronous). Those choices set the consistency–availability–latency tradeoff.
How It Works
- A write lands on a replica and propagates to the others.
- Synchronous replication holds the write until a quorum acknowledges; asynchronous returns immediately and propagates in the background.
- A consistency model — strong, read-your-writes, or eventual — defines which writes a later read must reflect.
- A quorum or consensus protocol (e.g. Raft, Paxos) coordinates write ordering and leader election.
- On node loss, surviving replicas keep serving and a new leader is elected if needed.
Failure Modes
- Stale reads: an async replica lags, so a reader sees data older than the last committed write.
- Split-brain: a partition lets two sides accept conflicting writes, producing divergent histories to reconcile.
- Lost write: an async leader acknowledges, then crashes before propagating, dropping a committed write.
- Lag cascade: a slow replica falls progressively behind under write load, widening the data-loss window.
Verification
- Measure replication lag (p99) and alert when it exceeds the recovery-point objective (RPO).
- Kill the leader under load; assert reads and writes resume within the failover budget with no committed-write loss.
- Run a consistency checker (e.g. a linearizability test) against the declared model.
- Partition the cluster and confirm the configured behavior — reject writes, or accept and reconcile.
Variants and Related Tactics
- Single-leader, multi-leader, or leaderless — where writes are accepted.
- Synchronous versus asynchronous — the durability-against-latency dial.
- Standby/Failover — replication is how a hot or warm standby stays current.
- Database Sharding — partitions data for scale; replication copies each partition for availability.