Standby/Failover keeps a redundant copy of a component ready so that when the active instance fails, a standby is promoted and service continues. It is the workhorse availability tactic behind clustered databases, redundant controllers, and multi-zone deployments.

The variants differ only in how warm the standby runs. A hot spare processes the same inputs in parallel and fails over in milliseconds; a warm spare takes periodic state updates and recovers in seconds; a cold spare stays off until needed and recovers in minutes. Warmer means faster recovery at higher standing cost.

How It Works

  • A failure detector — heartbeat, health probe, or lease expiry — watches the active component.
  • On a missed signal it declares failure and triggers failover.
  • A standby is promoted to active; traffic, virtual IP, or leadership moves to it.
  • State reaches the standby by continuous replication (hot), periodic checkpoint (warm), or fresh start (cold).
  • The repaired component rejoins as the new standby.

Failure Modes

  • Split-brain: a partition leaves both nodes acting as active, accepting writes and diverging. A quorum or fencing token prevents it.
  • Stale standby: replication lags, so the promoted node serves old or inconsistent state.
  • Failover flapping: an unstable primary triggers repeated promotions, amplifying disruption.
  • Untested path: failover runs for the first time during a real outage and silently fails.

Verification

  • Kill or partition the active component and measure time to full service on the standby against the recovery-time objective.
  • After a partition, assert exactly one node accepts writes — no split-brain.
  • Track replication lag and alert when it exceeds the data-loss budget (RPO).
  • Run scheduled failover drills and record the success rate.
  • Hot, warm, and cold spare — the standby-temperature spectrum above.
  • State Resynchronization — reconcile active and standby state before a promoted node goes live.
  • Voting / N-version redundancy — a different goal: mask wrong output for integrity, not survive a crash for availability.