A saga coordinates a business transaction spanning several services, each owning its private datastore. Instead of one distributed ACID transaction, it runs a sequence of local transactions; when a step fails, compensating actions undo the completed steps in reverse order — cancel, refund, release. The saga ends completed or compensated, never ACID-rolled-back: audit records and already-observed intermediate states remain.

How It Works
- Classify each step: compensatable (has a compensating action), the pivot (once it commits, the saga must run forward), and retriable (post-pivot steps designed so retries eventually succeed).
- Coordinate by choreography (services react to each other’s events; flows get hard to trace) or orchestration (a coordinator sends commands, tracks state, triggers compensation).
- Persist progress durably before acting; a crashed coordinator then resumes or compensates.
- Make all steps and compensations idempotent (brokers redeliver), and publish events atomically with the local commit via an outbox or change data capture.
Failure Modes
- Compensation failure — a compensating action itself fails; the coordinator retries, permanent failures page an operator.
- Lack of isolation — concurrent transactions see intermediate state that compensation later reverts; counter with semantic locks, commutative updates, or pessimistic re-reads.
- Sequencing errors — an irreversible action (a payout, an email) placed before the pivot cannot be retracted; make it the pivot or a retriable step.
- Lost publish — a participant commits but crashes before publishing; without the outbox the saga stalls silently.
- Timeout race — a timed-out step succeeds late after compensation: a charge landing after its refund.
- Stalled sagas — per-step timeouts and an age alert on the oldest incomplete saga bound limbo states.
Verification
- Inject failures at every step; the saga reaches completed or compensated within the per-flow SLA.
- Replay every message; side effects occur exactly once.
- Deliver a compensating event before its forward event; state stays correct.
- Cross-service reconciliation reports zero unexplained discrepancies.
Variants and Related Tactics
- Asynchronous Messaging — supplies delivery, retry, and queueing for both coordination styles.
- Two-Phase Commit — holds locks until the coordinated commit, blocks in-doubt participants, scales poorly.
References
- Sagas — Garcia-Molina & Salem, SIGMOD 1987 (free PDF)
- Microservices Patterns: With examples in Java — Chris Richardson, Manning, 2018