How would you design a global distributed cache system?
Why Interviewers Ask This
Senior System Design engineers are expected to reason about architecture, performance, and edge cases. This question separates mid-level from senior candidates by testing deep system-level understanding.
Answer
A global distributed cache serves low-latency data access across multiple geographic regions. Design like a multi-region Redis cluster. Topology options: (1) Regional caches: independent Redis clusters in each region (US, EU, APAC). Applications read from local region cache — zero cross-region latency for cache hits. Cache invalidation propagated across regions asynchronously. Risk: brief inconsistency between regions; (2) Hierarchical cache: L1 in-process cache (per app instance) → L2 regional Redis → L3 origin database. L1 has lowest latency (microseconds) but small size and no sharing between instances; L3 is slowest but authoritative. Cache invalidation across regions: when data changes, publish an invalidation event to Kafka → consumers in each region invalidate local cache entries. Eventual consistency — brief window where stale data serves. For strong consistency: use cache-aside with short TTLs and version validation. Partitioning: within a region, consistent hashing with vnodes across Redis cluster nodes — supports N nodes with minimal remapping on topology change. Redis Cluster uses hash slots (16384 total) assigned to nodes. Replication within region: each Redis primary has replicas — primary handles writes, replicas serve reads. Replica promotion on primary failure. Data eviction: LRU (Least Recently Used) for general caches; LFU (Least Frequently Used) for access-frequency-based eviction; TTL-based for time-sensitive data. Write strategies per region: write-through (write to both cache and DB) or write-behind (write to cache, async flush). Monitoring: cache hit rate, eviction rate, memory usage, replication lag per region.
Common Mistake
A common mistake is memorizing definitions without understanding implications. When asked this question, go one level deeper — explain what happens when this concept is misused or ignored.
Previous
What is the CAP theorem's extension — PACELC theorem?
Next
What is a vector clock and how does it solve consistency problems?
More System Design Questions
View all →- Advanced How would you design a distributed file system like HDFS?
- Advanced How would you design a video streaming service like Netflix?
- Advanced What is the consistent hashing with virtual nodes in detail?
- Advanced How would you design a global distributed database like Google Spanner?
- Advanced What is the difference between optimistic and pessimistic locking in distributed systems?