What is the Saga pattern for distributed transactions?
Why Interviewers Ask This
Mid-level System Design roles require deep understanding of this topic. Interviewers ask this to separate candidates who truly understand the mechanics from those who only know surface-level concepts.
Answer
The Saga pattern manages data consistency across multiple services in a microservices architecture without distributed transactions (2PC). A Saga is a sequence of local transactions, where each step publishes an event/message to trigger the next step, and each step has a compensating transaction to undo its effects if a later step fails. Types: (1) Choreography: each service publishes events and reacts to events from other services autonomously — no central coordinator. Pro: simple, loose coupling. Con: hard to follow the flow, risk of cyclic dependencies. Example: Order service publishes OrderCreated → Payment service processes payment, publishes PaymentProcessed → Inventory service reserves items, publishes InventoryReserved → Shipping service schedules delivery. On failure: each service listens for failure events and triggers compensation (PaymentFailed → Order service marks order cancelled); (2) Orchestration: a central Saga orchestrator explicitly tells each service what to do and reacts to replies. Pro: centralized flow, easy to understand. Con: risk of becoming a god object, more coupling. The orchestrator runs as a separate service/workflow. Compensating transactions: undo a successfully completed step — "cancel reservation," "refund payment." Must be idempotent. Key insight: Sagas provide eventual consistency, not ACID. Steps may briefly be in an inconsistent state. Design for "compensatable" transactions — not all operations can be undone (e.g., an email sent cannot be unsent). Tools: AWS Step Functions, Temporal, Conductor, Axon Framework.
Pro Tip
If you're unsure about a detail, say so honestly and explain your reasoning. Interviewers respect candidates who can think through uncertainty rather than bluffing.
Previous
What is CQRS (Command Query Responsibility Segregation)?
Next
What is a write-ahead log (WAL)?