Data Replication

Data Replication keeps copies of the same data on independent nodes. Copies let the system survive node loss, serve reads from the nearest or least-loaded replica, and scale read throughput. The hard part is not making copies — it is deciding what a reader sees while copies are briefly out of step.

Strategies sit on two axes: where writes are accepted (single-leader, multi-leader, leaderless) and when copies converge (synchronous or asynchronous). Those choices set the consistency–availability–latency tradeoff.

How It Works

A write lands on a replica and propagates to the others.
Synchronous replication holds the write until a quorum acknowledges; asynchronous returns immediately and propagates in the background.
A consistency model — strong, read-your-writes, or eventual — defines which writes a later read must reflect.
A quorum or consensus protocol (e.g. Raft, Paxos) coordinates write ordering and leader election.
On node loss, surviving replicas keep serving and a new leader is elected if needed.

Failure Modes

Stale reads: an async replica lags, so a reader sees data older than the last committed write.
Split-brain: a partition lets two sides accept conflicting writes, producing divergent histories to reconcile.
Lost write: an async leader acknowledges, then crashes before propagating, dropping a committed write.
Lag cascade: a slow replica falls progressively behind under write load, widening the data-loss window.

Verification

Measure replication lag (p99) and alert when it exceeds the recovery-point objective (RPO).
Kill the leader under load; assert reads and writes resume within the failover budget with no committed-write loss.
Run a consistency checker (e.g. a linearizability test) against the declared model.
Partition the cluster and confirm the configured behavior — reject writes, or accept and reconcile.

Single-leader, multi-leader, or leaderless — where writes are accepted.
Synchronous versus asynchronous — the durability-against-latency dial.
Standby/Failover — replication is how a hot or warm standby stays current.
Database Sharding — partitions data for scale; replication copies each partition for availability.

Supported Qualities

Availability

Reads and writes continue against a surviving replica when a node fails.

Durability

Copies on independent nodes survive the loss of any single node or disk.

Scalability

Read replicas spread read load across copies, raising read throughput beyond one node.

Trade-offs

Consistency
Replicas diverge between updates. Strong consistency coordinates every write across replicas — slower, and unavailable under partition; relaxing to eventual consistency returns fast but can serve stale reads. This is the CAP/PACELC tradeoff.
Latency
Synchronous replication holds each write until enough replicas acknowledge, so write latency tracks the slowest quorum member, and a geographically distant replica adds round-trip time to every committed write.
Cost
Every replica multiplies storage, and cross-region replication adds continuous network transfer; N copies cost roughly N times the storage plus the bandwidth to keep them current.

Related Requirements

Requirements connected to this approach can be listed here.

Replication and Quorum Reads/Writes

Quorum reads and writes across replicas are exactly the protocol this requirement specifies for tolerating node loss and partitions.

Available 7x24 with 99% uptime

Surviving replicas keep the datastore reachable when a node fails, holding the uptime target.

Zone failure: no service interruption

A replica in another zone keeps serving when a whole zone goes down.

Related Approaches

N-Version Redundancy and Voting

Redundancy at different layers: replication duplicates stored data for durability and read scaling, while N-version duplicates computation with design diversity to catch wrong results. One protects state, the other protects the compute over it.

Standby/Failover

Failover spares the compute component; replication copies the state it needs. A stateful standby is only useful once data is replicated to it, so the two tactics are usually deployed together.

Data Replication

Intent

Mechanism

Applicability

How It Works

Failure Modes

Verification

Supported Qualities

Trade-offs

Related Requirements

Related Approaches

Intent

Mechanism

Applicability

How It Works

Failure Modes

Verification

Variants and Related Tactics

Supported Qualities

Trade-offs

Related Requirements

Related Approaches