Database sharding horizontally partitions one logical database into multiple shards, each responsible for a subset of rows. Requests are routed by shard key so that most reads and writes hit one shard, not the whole fleet. Sharding is primarily a scale-out technique for data size and write throughput.
It is not a default optimization. A well-indexed single cluster, read replicas, archival, or CQRS are usually cheaper to operate. The shard key is the make-or-break design decision: a poor key creates hot shards, scatter-gather queries, and painful resharding.
How It Works
- Choose a shard key that appears in the dominant read and write paths, for example
tenant_id,customer_id, or an order-range identifier. - Partition rows by range, hash, or directory lookup so each shard owns a non-overlapping subset of the key space.
- Route requests through a router or metadata service that maps the shard key to the correct shard; queries without the key usually fan out to multiple shards.
- Keep data that must be queried or updated together on the same shard whenever possible; sharding and replication are orthogonal, so each shard typically has its own primary and replicas.
- Plan online resharding from day one: copy data to new shards, keep source and target in sync, verify counts and checksums, then cut over routing with a bounded window.
Failure Modes
- Hot shard: the chosen key is monotonic or heavily skewed, so one shard receives disproportionate write load or storage growth.
- Scatter-gather queries dominate because the application often queries by fields that are not the shard key or its prefix.
- Cross-shard joins, transactions, or global unique constraints become common, eliminating the expected scale benefit and increasing coordination cost.
- Resharding is treated as an exceptional project instead of a rehearsed operation, leading to long cutovers, drift, or data loss risk.
- Large tenants or key ranges outgrow the original distribution model, forcing emergency repartitioning under production load.
Verification
- Balance SLI: under representative load, no shard exceeds 125% of the cluster median for data size, write QPS, or storage utilization for more than the agreed observation window.
- Routing efficiency: at least 95% of OLTP reads and writes are targeted to a single shard; scatter-gather operations stay below the agreed threshold (for example <= 5% of requests).
- Hotspot test: during peak load, the hottest shard stays within the declared headroom budget (for example p95 CPU, IOPS, or QPS <= 1.5x the shard median).
- Resharding drill: perform an online split or merge on representative production-like data; row counts and checksums match 100%, no acknowledged write is lost, and p95 latency degradation during cutover stays within the agreed limit (for example <= 25%).
- Failure isolation: take one shard replica set out in staging and verify that unaffected shards remain within normal latency and error budgets while the impact on the failed shard matches the documented SLO.