What is zero-downtime deployment and how do you achieve it?
Why Interviewers Ask This
This is a differentiating question used for senior and lead roles. Interviewers want to see if you can explain not just what happens, but why — and what the trade-offs are in different approaches.
Answer
Zero-downtime deployment updates a running system without users experiencing outage or errors. Critical for systems with strict SLAs. Strategies: (1) Rolling update: replace instances one at a time (or in batches). At any point, some instances run old version, some new. Kubernetes default. Simple but during deployment, old and new code run simultaneously — must be backward compatible. If new version is buggy, roll back by re-deploying old version; (2) Blue-green deployment: maintain two identical environments. Blue = current production, Green = new version. Deploy to Green, run tests, switch load balancer to point to Green. Instant rollback = flip LB back to Blue. Cost: need 2x infrastructure during deployment. Database migrations must be backward compatible (Blue must still work with migrated schema); (3) Canary deployment: route small percentage (1%, 5%) of traffic to new version. Monitor errors/latency. If healthy, gradually increase percentage. Roll back by routing all traffic back to old version. Advanced: route only certain user segments (internal users, beta group) to canary; (4) Feature flags: deploy code with feature disabled, enable flag progressively for user segments. New code is deployed but not activated until ready. Database migrations for zero-downtime: expand-contract (backward compatible) pattern: Phase 1 — add new column with nullable, deploy code that writes to both old and new column; Phase 2 — backfill old data; Phase 3 — deploy code that reads from new column; Phase 4 — drop old column. Never do a breaking schema change while old code is running.
Pro Tip
Back up your answer with a specific project or situation. Saying 'In my last System Design project, I used this when...' immediately makes your answer more credible and memorable.
Previous
How would you design a distributed lock service?
Next
How would you design Instagram's system architecture?
More System Design Questions
View all →- Advanced How would you design a distributed file system like HDFS?
- Advanced How would you design a video streaming service like Netflix?
- Advanced What is the consistent hashing with virtual nodes in detail?
- Advanced How would you design a global distributed database like Google Spanner?
- Advanced What is the difference between optimistic and pessimistic locking in distributed systems?