Standby/Failover keeps a redundant copy of a component ready so that when the active instance fails, a standby is promoted and service continues. It is the workhorse availability tactic behind clustered databases, redundant controllers, and multi-zone deployments.
The variants differ only in how warm the standby runs. A hot spare processes the same inputs in parallel and fails over in milliseconds; a warm spare takes periodic state updates and recovers in seconds; a cold spare stays off until needed and recovers in minutes. Warmer means faster recovery at higher standing cost.
How It Works
- A failure detector — heartbeat, health probe, or lease expiry — watches the active component.
- On a missed signal it declares failure and triggers failover.
- A standby is promoted to active; traffic, virtual IP, or leadership moves to it.
- State reaches the standby by continuous replication (hot), periodic checkpoint (warm), or fresh start (cold).
- The repaired component rejoins as the new standby.
Failure Modes
- Split-brain: a partition leaves both nodes acting as active, accepting writes and diverging. A quorum or fencing token prevents it.
- Stale standby: replication lags, so the promoted node serves old or inconsistent state.
- Failover flapping: an unstable primary triggers repeated promotions, amplifying disruption.
- Untested path: failover runs for the first time during a real outage and silently fails.
Verification
- Kill or partition the active component and measure time to full service on the standby against the recovery-time objective.
- After a partition, assert exactly one node accepts writes — no split-brain.
- Track replication lag and alert when it exceeds the data-loss budget (RPO).
- Run scheduled failover drills and record the success rate.
Variants and Related Tactics
- Hot, warm, and cold spare — the standby-temperature spectrum above.
- State Resynchronization — reconcile active and standby state before a promoted node goes live.
- Voting / N-version redundancy — a different goal: mask wrong output for integrity, not survive a crash for availability.