Fail-safe defaults ensure that when something unexpected happens, the system falls into a known-safe state rather than an open, permissive, or undefined one. The principle applies at every level: hardware that de-energizes to a safe position on power loss, software that denies access when an authorization check fails to complete, and configuration that disables dangerous features when a setting is missing.

The idea was formalized by Saltzer and Schroeder in 1975 as one of eight design principles for information protection, where it meant basing access decisions on permission rather than exclusion. In safety engineering, the same concept appears as the de-energize-to-safe principle: a valve closes, a brake engages, a process halts — not because an active command says so, but because the absence of a valid active signal defaults to the safe position.

How It Works

  • Define, for every component and operational mode, what the safe state is — the configuration that minimizes harm when the system cannot determine the correct action.
  • Treat all unrecognized or missing inputs as invalid by default: deny access, reject commands, disable features, halt motion.
  • On detecting an anomaly that exceeds defined tolerance (watchdog timeout, invariant violation, sensor out of range), transition the component to its safe state and raise an alert.
  • Make the safe-state transition atomic where possible — partial transitions can leave the system in a state that is neither operational nor safe.
  • Document each safe state explicitly so operators know what to expect after a transition and what recovery steps are required.

Failure Modes

  • A safe state defined too conservatively triggers on benign transients, causing unnecessary shutdowns and eroding operator trust (nuisance trips).
  • A safe state defined too loosely allows genuinely hazardous conditions to persist because the threshold for transition is never reached.
  • Partial transitions leave the system in a state that is neither safe nor operational — for example, one actuator halted while another continues.
  • Missing safe-state definitions for newly added components or modes: developers add a feature but forget to define what happens when it fails.
  • Operators override or disable fail-safe logic after repeated nuisance trips, removing the safety net entirely.

Verification

  • Fault injection: for each defined anomaly class (watchdog timeout, invalid input, missing config, sensor out of range), inject the condition and verify the system reaches the documented safe state within the specified time (for example < 500 ms for a safety controller).
  • Completeness audit: confirm that every component and operational mode has a documented safe state; flag any path that can reach an undefined state as a gap.
  • Atomicity check: interrupt the safe-state transition at various points (power cut, process kill) and verify the system does not settle in a half-transitioned state.
  • Recovery round-trip: after a safe-state transition, execute the documented recovery procedure and verify the system returns to normal operation within the agreed window.
  • Deny-by-default access control applies fail-safe defaults to authorization: any request not explicitly permitted is denied.
  • Watchdog supervision uses a periodic heartbeat; absence of the heartbeat triggers the safe-state transition.
  • Dead man’s switch requires continuous active confirmation to remain in an operational mode — releasing the switch defaults to safe.
  • Circuit breaker implements a fail-safe default for remote dependencies: when failures exceed a threshold, the breaker opens and returns a safe fallback rather than forwarding requests.

References