Context

A business-critical online service must remain available during infrastructure failures.

Trigger

One complete availability zone (or equivalent independent failure domain) becomes unavailable.

Acceptance Criteria

  • The production deployment provides N+1 capacity for stateless service workloads across at least 2 independent failure domains.
  • Loss of one failure domain causes 0 user-visible outage for core functions.
  • After failover stabilization, sustained throughput remains at least 95% of pre-failure baseline.
  • At the 95th percentile, end-user response time degradation stays within <= 20% of pre-failure baseline.
  • Automatic failover and recovery actions are completed within <= 60 seconds in quarterly resilience drills.