A saga coordinates a business transaction spanning several services, each owning its private datastore. Instead of one distributed ACID transaction, it runs a sequence of local transactions; when a step fails, compensating actions undo the completed steps in reverse order — cancel, refund, release. The saga ends completed or compensated, never ACID-rolled-back: audit records and already-observed intermediate states remain.

Saga flow: four local transactions run in sequence; the last one fails, so compensating transactions undo the three completed steps in reverse order, ending fully undone.

How It Works

  • Classify each step: compensatable (has a compensating action), the pivot (once it commits, the saga must run forward), and retriable (post-pivot steps designed so retries eventually succeed).
  • Coordinate by choreography (services react to each other’s events; flows get hard to trace) or orchestration (a coordinator sends commands, tracks state, triggers compensation).
  • Persist progress durably before acting; a crashed coordinator then resumes or compensates.
  • Make all steps and compensations idempotent (brokers redeliver), and publish events atomically with the local commit via an outbox or change data capture.

Failure Modes

  • Compensation failure — a compensating action itself fails; the coordinator retries, permanent failures page an operator.
  • Lack of isolation — concurrent transactions see intermediate state that compensation later reverts; counter with semantic locks, commutative updates, or pessimistic re-reads.
  • Sequencing errors — an irreversible action (a payout, an email) placed before the pivot cannot be retracted; make it the pivot or a retriable step.
  • Lost publish — a participant commits but crashes before publishing; without the outbox the saga stalls silently.
  • Timeout race — a timed-out step succeeds late after compensation: a charge landing after its refund.
  • Stalled sagas — per-step timeouts and an age alert on the oldest incomplete saga bound limbo states.

Verification

  • Inject failures at every step; the saga reaches completed or compensated within the per-flow SLA.
  • Replay every message; side effects occur exactly once.
  • Deliver a compensating event before its forward event; state stays correct.
  • Cross-service reconciliation reports zero unexplained discrepancies.
  • Asynchronous Messaging — supplies delivery, retry, and queueing for both coordination styles.
  • Two-Phase Commit — holds locks until the coordinated commit, blocks in-doubt participants, scales poorly.

References