What is database sharding?
Why Interviewers Ask This
This is a classic screening question for System Design roles. Hiring managers ask it early in interviews to gauge your baseline understanding and determine if you can communicate technical concepts clearly.
Answer
Database sharding is a horizontal partitioning strategy that distributes data across multiple database instances (shards), with each shard holding a subset of the total data. Each shard is an independent database that only stores its portion of the data. Sharding strategies: (1) Range-based: shard by value range (user IDs 1-1M → shard 1, 1M-2M → shard 2). Simple routing but risk of hot spots; (2) Hash-based: apply hash function to shard key (hash(user_id) % num_shards). Distributes evenly, but range queries require all shards; (3) Directory-based: lookup table maps records to shards — flexible, but lookup table is a bottleneck; (4) Geo-based: shard by geography — EU users on EU shard, US on US shard — data sovereignty compliance. Benefits: each shard handles less data and load, enabling horizontal scaling beyond one machine's limits. Challenges: cross-shard joins are complex or impossible; cross-shard transactions require distributed protocols; resharding when adding nodes is complex (consistent hashing helps); shard key selection is critical — wrong key causes hot spots. Choosing the shard key: high cardinality (many distinct values), uniform distribution, aligns with query patterns. Tools: Vitess (MySQL sharding), Citus (PostgreSQL sharding). MongoDB and Cassandra have built-in sharding.
Pro Tip
If you're unsure about a detail, say so honestly and explain your reasoning. Interviewers respect candidates who can think through uncertainty rather than bluffing.