How would you design a distributed lock service?
Why Interviewers Ask This
Senior System Design engineers are expected to reason about architecture, performance, and edge cases. This question separates mid-level from senior candidates by testing deep system-level understanding.
Answer
A distributed lock service ensures mutual exclusion across processes on different machines — critical for leader election, distributed cron jobs, resource access control. Requirements: mutual exclusion (only one holder at a time), deadlock-free (locks auto-expire if holder crashes), fault tolerant (lock service must be highly available), low latency. Redis-based lock (Redlock): single Redis: SETNX key "lock_value" EX 30 (set if not exists, expire in 30s). Release: check value matches, then DEL (Lua script for atomicity). Problem: if Redis crashes, lock state is lost. Redlock (multi-node): acquire lock on N Redis instances (N=5), succeed if majority (3/5) acquired within timeout. Provides better fault tolerance. Controversy: Martin Kleppmann argued Redlock is unsafe for strong mutual exclusion (clock skew, GC pauses); use fencing tokens (monotonically increasing token, servers reject lower tokens). ZooKeeper-based lock: create ephemeral sequential znode under /locks/my-lock. The node with lowest sequence number holds the lock. Others watch the preceding node. When predecessor deleted (holder released or crashed — ephemeral), the next watches node is notified. Completely reliable for mutual exclusion. ZooKeeper Raft consensus ensures durability. etcd-based lock: similar to ZooKeeper via lease-based grants. Use etcd's Compare-And-Swap: PUT key value --prev-kv (only if current value matches). Leases expire if client crashes. Design principles: use TTL/lease expiration (prevents stale locks), use fencing tokens to prevent use of stale locks even if holder thinks it holds the lock, keep critical sections short.
Pro Tip
Before answering, structure your response: one-line definition → real-world analogy → concrete example from a project. This makes even complex System Design answers easy to follow.
Previous
What is geo-sharding and how do you handle data locality requirements?
Next
What is zero-downtime deployment and how do you achieve it?
More System Design Questions
View all →- Advanced How would you design a distributed file system like HDFS?
- Advanced How would you design a video streaming service like Netflix?
- Advanced What is the consistent hashing with virtual nodes in detail?
- Advanced How would you design a global distributed database like Google Spanner?
- Advanced What is the difference between optimistic and pessimistic locking in distributed systems?