What is database sharding?

Q: What is database sharding?

Database sharding is a horizontal partitioning strategy that distributes data across multiple database instances (shards), with each shard holding a subset of the total data. Each shard is an independent database that only stores its portion of the data. Sharding strategies: (1) Range-based: shard by value range (user IDs 1-1M → shard 1, 1M-2M → shard 2). Simple routing but risk of hot spots; (2) Hash-based: apply hash function to shard key (hash(user_id) % num_shards). Distributes evenly, b

Answer

Database sharding is a horizontal partitioning strategy that distributes data across multiple database instances (shards), with each shard holding a subset of the total data. Each shard is an independent database that only stores its portion of the data. Sharding strategies: (1) Range-based: shard by value range (user IDs 1-1M → shard 1, 1M-2M → shard 2). Simple routing but risk of hot spots; (2) Hash-based: apply hash function to shard key (hash(user_id) % num_shards). Distributes evenly, but range queries require all shards; (3) Directory-based: lookup table maps records to shards — flexible, but lookup table is a bottleneck; (4) Geo-based: shard by geography — EU users on EU shard, US on US shard — data sovereignty compliance. Benefits: each shard handles less data and load, enabling horizontal scaling beyond one machine's limits. Challenges: cross-shard joins are complex or impossible; cross-shard transactions require distributed protocols; resharding when adding nodes is complex (consistent hashing helps); shard key selection is critical — wrong key causes hot spots. Choosing the shard key: high cardinality (many distinct values), uniform distribution, aligns with query patterns. Tools: Vitess (MySQL sharding), Citus (PostgreSQL sharding). MongoDB and Cassandra have built-in sharding.

Answer

More System Design Questions