Context/Background

The system depends on multiple upstream services (payments, profiles, notifications). Transient upstream failures must not cascade or break core user flows.

Metric/Acceptance Criteria

  • Implement circuit breakers with automatic open/half‑open/close states and exponential backoff.
    • Sliding window: 10–60s time window or 20–200 requests minimum sample size (use both where supported).
    • Failure‑rate threshold: 20–50% (default 50%); require min 20 requests before evaluation.
    • Cool‑down (open state): 10–60s; half‑open probes: 1–5 concurrent trial requests; close after 5–10 consecutive successes.
    • Timeouts per dependency: 200–1500ms typical; retries: 0–2 with jittered exponential backoff (cap 2–5s).
  • Dependency classification and strategies:
    • Critical dependencies (payments, auth): prefer graceful degradation with cached/queued fallbacks; stricter thresholds (e.g., 20–30% failure trip), shorter timeouts, fewer retries.
    • Non‑critical (recommendations, analytics): fail‑silent with placeholders or skip; looser thresholds (e.g., 40–50%), longer cool‑downs.
    • Each dependency must declare timeout, retry policy, trip threshold, and fallback behavior in config.
  • Recovery testing:
    • Exercise half‑open → closed transitions in tests by restoring upstream health; require ≥5 consecutive probe successes to close; ensure queued work drains without user‑visible errors.
    • Verify state resets do not cause thundering herds (limit probe concurrency; retain backoff until closed).
  • When a breaker opens, the user experience degrades gracefully (e.g., hide recommendations, queue notifications) while core operations succeed; display a neutral placeholder, never a stack trace.
  • All client calls are idempotent (PUT/DELETE with idempotency keys; POST with dedup keys) to allow safe retries.
  • Latency and errors during failure injection:
    • Median end‑user latency increases by ≤20%; p95 ≤ 2× baseline, p99 ≤ 3× baseline.
    • Edge error rate (5xx/timeouts) ≤ 0.5% over any rolling 10‑minute window.
  • Observability (must track and dashboard):
    • Per‑dependency p50/p95/p99 latency; success/error rates by error type (timeout, connect error, 5xx, 4xx policy).
    • Breaker state transitions, time in state, open/close counts, half‑open probe success rate.
    • Retry counts, backoff/cancel rates, queue depth, and client/thread‑pool saturation.
  • Validate via chaos exercises at least quarterly (inject 5xx, timeouts, and 2× latency), demonstrating compliance with thresholds and successful recovery to closed state.