Context

The system depends on multiple upstream services (payments, profiles, notifications). Transient upstream failures must not cascade or break core user flows.

Trigger

Upstream service experiences transient failures or degraded performance during normal system operation.

Acceptance Criteria

  • Implement circuit breakers with automatic open/half‑open/close states and exponential backoff
    • Sliding window: 10–60s time window or 20–200 requests minimum sample size (use both where supported)
    • Failure‑rate threshold: 20–50% (default 50%); require min 20 requests before evaluation
    • Cool‑down (open state): 10–60s; half‑open probes: 1–5 concurrent trial requests; close after 5–10 consecutive successes
    • Timeouts per dependency: 200–1500ms typical; retries: 0–2 with jittered exponential backoff (cap 2–5s)
  • Dependency classification and strategies:
    • Critical dependencies (payments, auth): prefer graceful degradation with cached/queued fallbacks; stricter thresholds (e.g., 20–30% failure trip), shorter timeouts, fewer retries
    • Non‑critical (recommendations, analytics): fail‑silent with placeholders or skip; looser thresholds (e.g., 40–50%), longer cool‑downs
    • Each dependency must declare timeout, retry policy, trip threshold, and fallback behavior in config
  • Recovery testing:
    • Exercise half‑open → closed transitions in tests by restoring upstream health; require ≥5 consecutive probe successes to close; ensure queued work drains without user‑visible errors
    • Verify state resets do not cause thundering herds (limit probe concurrency; retain backoff until closed)
  • When a breaker opens, the user experience degrades gracefully (e.g., hide recommendations, queue notifications) while core operations succeed; display a neutral placeholder, never a stack trace
  • All client calls are idempotent (PUT/DELETE with idempotency keys; POST with dedup keys) to allow safe retries
  • Latency and errors during failure injection:
    • Median end‑user latency increases by ≤20%; p95 ≤ 2× baseline, p99 ≤ 3× baseline
    • Edge error rate (5xx/timeouts) ≤ 0.5% over any rolling 10‑minute window
  • Observability (must track and dashboard):
    • Per‑dependency p50/p95/p99 latency; success/error rates by error type (timeout, connect error, 5xx, 4xx policy)
    • Breaker state transitions, time in state, open/close counts, half‑open probe success rate
    • Retry counts, backoff/cancel rates, queue depth, and client/thread‑pool saturation
  • Validate via chaos exercises at least quarterly (inject 5xx, timeouts, and 2× latency), demonstrating compliance with thresholds and successful recovery to closed state