Context
An e-commerce platform runs as a distributed system with multiple services. Operations engineers must detect and assess production anomalies quickly — without accessing source code or redeploying instrumentation.
Trigger
A production anomaly occurs (e.g., elevated error rate, latency spike, resource exhaustion, or unexpected traffic pattern).
Acceptance Criteria
- The anomaly is visible on operational dashboards within 2 minutes of onset
- All services emit structured logs (JSON) with at minimum: timestamp, severity, service name, trace ID, and error category
- All inter-service calls are captured as distributed traces with end-to-end latency breakdown per hop
- Key runtime metrics (request rate, error rate, p95 latency, CPU, memory) are collected at ≤ 30-second granularity
- Dashboards allow drill-down from system-wide overview to individual service metrics without custom queries
- Alerting rules trigger automated notifications within 90 seconds of threshold breach