Standby/Failover

Standby/Failover keeps a redundant copy of a component ready so that when the active instance fails, a standby is promoted and service continues. It is the workhorse availability tactic behind clustered databases, redundant controllers, and multi-zone deployments.

The variants differ only in how warm the standby runs. A hot spare processes the same inputs in parallel and fails over in milliseconds; a warm spare takes periodic state updates and recovers in seconds; a cold spare stays off until needed and recovers in minutes. Warmer means faster recovery at higher standing cost.

How It Works

A failure detector — heartbeat, health probe, or lease expiry — watches the active component.
On a missed signal it declares failure and triggers failover.
A standby is promoted to active; traffic, virtual IP, or leadership moves to it.
State reaches the standby by continuous replication (hot), periodic checkpoint (warm), or fresh start (cold).
The repaired component rejoins as the new standby.

Failure Modes

Split-brain: a partition leaves both nodes acting as active, accepting writes and diverging. A quorum or fencing token prevents it.
Stale standby: replication lags, so the promoted node serves old or inconsistent state.
Failover flapping: an unstable primary triggers repeated promotions, amplifying disruption.
Untested path: failover runs for the first time during a real outage and silently fails.

Verification

Kill or partition the active component and measure time to full service on the standby against the recovery-time objective.
After a partition, assert exactly one node accepts writes — no split-brain.
Track replication lag and alert when it exceeds the data-loss budget (RPO).
Run scheduled failover drills and record the success rate.

Hot, warm, and cold spare — the standby-temperature spectrum above.
State Resynchronization — reconcile active and standby state before a promoted node goes live.
Voting / N-version redundancy — a different goal: mask wrong output for integrity, not survive a crash for availability.

Supported Qualities

Availability

A standby takes over when the active component fails, keeping the service reachable through the outage.

Fault tolerance

Losing one component does not stop the system, because a redundant peer continues the work.

Recoverability

Automated failover restores service in seconds to minutes, without waiting for the failed component to be repaired.

Trade-offs

Cost
Redundant capacity is paid for before any failure occurs. A hot spare that mirrors the active component roughly doubles the resource bill; colder standbys cut that cost but lengthen recovery time.
Consistency
Stateful failover relies on replicating state to the standby. A warm or cold spare can be promoted with stale or missing state, so in-flight work is dropped or replayed on takeover.
Maintainability
Failure detection, promotion logic, and split-brain prevention are machinery the team builds, tunes, and must exercise regularly, or failover quietly rots until the outage that needs it.

Related Requirements

Requirements connected to this approach can be listed here.

Available 7x24 with 99% uptime

A ready standby keeps the system inside its uptime objective when the active component fails.

Server fails, system continues to operate without downtime

Promoting a standby on server failure is exactly the mechanism this requirement demands.

Zone failure: no service interruption

A standby in a second zone absorbs a whole-zone outage without interrupting service.

Unavailable for max 2 minutes

The recovery-time budget sets the standby temperature: a two-minute ceiling rules out a cold spare that boots slower.

Related Approaches

Data Replication

Failover spares the compute component; replication copies the state it needs. A stateful standby is only useful once data is replicated to it, so the two tactics are usually deployed together.

N-Version Redundancy and Voting

Both run redundant copies, but against different faults: identical standbys survive independent hardware or crash failures, while diverse N-version replicas survive the common-mode design faults that identical copies would all share.

Quarantine

Complementary — quarantine removes the unstable instance; failover promotes a standby to cover the gap, so capacity holds while the detached component is diagnosed.

Standby/Failover

Intent

Mechanism

Applicability

How It Works

Failure Modes

Verification

Supported Qualities

Trade-offs

Related Requirements

Related Approaches

Intent

Mechanism

Applicability

How It Works

Failure Modes

Verification

Variants and Related Tactics

Supported Qualities

Trade-offs

Related Requirements

Related Approaches