Watchdog Supervision

Watchdog supervision places an independent monitor alongside a supervised component. The monitor expects a periodic signal — a heartbeat, a liveness-probe response, or a timer reset — and if that signal stops arriving within a configured timeout, the monitor concludes the component is hung and takes corrective action: restart the process, fail over to a standby, or transition to a safe state.

The pattern is fundamental to both safety-critical systems (hardware watchdog timers in embedded controllers that cut power or engage brakes on timeout) and cloud-native infrastructure (Kubernetes liveness probes that restart unresponsive pods). The key property is independence: the monitor runs in a separate failure domain from the supervised component, so it can act even when the component is completely unresponsive.

How It Works

The supervised component periodically sends a heartbeat or resets a watchdog timer to prove it is alive and making progress.
The supervisor — a separate process, sidecar, orchestrator, or hardware timer — monitors the signal and starts a countdown on each expected interval.
If the countdown expires without a signal (missed heartbeat), the supervisor declares the component unresponsive and initiates the configured action: process restart, container replacement, failover to a replica, or safe-state transition.
Before forcing the restart, the supervisor optionally captures diagnostic state (thread dump, heap snapshot, last log lines) to support root-cause analysis.
After restart, the supervisor monitors the component through a startup grace period before resuming normal heartbeat checks, avoiding restart loops during slow initialization.

Failure Modes

Heartbeat interval too short relative to normal processing variance (GC pauses, I/O bursts) causes false-positive restarts that reduce availability instead of improving it.
Restart loops: the supervised component fails immediately after restart, the watchdog restarts it again, and the cycle repeats without recovery — an escalation strategy (backoff, circuit-break, alert) is missing.
The watchdog itself fails or hangs, leaving the supervised component unmonitored — the monitor must be simpler and more reliable than the component it supervises.
Restarts without pre-restart diagnostic capture destroy the evidence needed to find the root cause, turning a recoverable hang into a recurring mystery.
Shared-fate deployment: the watchdog runs on the same host or in the same process as the supervised component, so a host-level failure takes out both.

Verification

Hang injection: pause the supervised component (for example SIGSTOP on Linux, simulated deadline miss on an embedded controller) and verify the watchdog detects the hang and triggers restart within the configured timeout plus a tolerance margin (for example timeout 30 s + margin 5 s).
False-positive test: under peak load with realistic GC or I/O latency, verify zero spurious restarts over a representative test window (for example 24 hours of load-test traffic).
Restart-loop protection: force the supervised component to crash immediately after startup and verify the watchdog applies backoff or stops restarting after the configured attempt limit (for example 3 attempts within 5 minutes) and escalates to an alert.
Diagnostic capture: after a watchdog-triggered restart, verify that pre-restart diagnostics (thread dump, last N log lines) are persisted and accessible.
Independence validation: kill the host or VM running the supervised component and verify the watchdog (running in a separate failure domain) still detects the failure and triggers failover.

Hardware watchdog timers in embedded and safety-critical systems reset a physical timer; expiry triggers a hardware-level reset or safe-state relay.
Kubernetes liveness probes are software watchdogs: the kubelet periodically checks an HTTP endpoint, TCP socket, or exec command and restarts the container on repeated failures.
Erlang/OTP supervisors implement hierarchical watchdog trees with configurable restart strategies (one-for-one, one-for-all, rest-for-one).
Heartbeat / ping-echo probes are the active variant where the supervisor sends a probe and expects a response, rather than passively waiting for a signal from the supervised component.
Escalating restart increases the scope of the restart (process → container → node → zone) if repeated restarts at the current level do not restore health.

References

IEC 61508: Functional Safety of E/E/PE Systems — defines watchdog timer requirements for safety-instrumented systems
Software Architecture in Practice — Bass, Clements & Kazman (full citation)

Watchdog Supervision

Intent

Mechanism

Applicability

How It Works

Failure Modes

Verification

References

Supported Qualities

Trade-offs

Related Requirements

Intent

Mechanism

Applicability

How It Works

Failure Modes

Verification

Variants and Related Tactics

References

Supported Qualities

Trade-offs

Related Requirements