MySQL Sharding Technologies Compared: Range, Hash, Consis...

1. Sharding Concepts

Sharding is a core technology in distributed database systems that determines how data is distributed across multiple storage devices. Imagine the entire database as a piece of glass - when shattered into multiple pieces, each piece becomes a database shard. Splitting a complete database into multiple shards is called sharding, a typical Scale-out solution.

Core Concept Distinctions:

Sharding: Logical concept, referring to data partitioning strategies and methods, describing algorithms and rules for distributing data across nodes
Database/Table Sharding: Physical implementation result, concrete manifestation of sharding strategy

Database Scaling Solutions Comparison:

Scale-out: Expands system by adding more machines, splitting a single database into multiple instances, theoretically unlimited scalability
Scale-up: Enhances system capability by improving single-machine performance, upgrading CPU, increasing memory, using faster storage

2. Range-Based Sharding

Range-based sharding distributes data across different storage nodes based on value ranges of specific fields. Common sharding keys include: user IDs (e.g., 0-1 million on node 1, 1-2 million on node 2), order timestamps (grouped by month), product prices, etc.

Advantages:

Expansion-friendly: New data naturally falls on nodes for the new range; cluster expansion only needs to add nodes for new ranges without redistributing old data
Query efficiency: Excellent range query performance, supports efficient sequential scans on sharding keys

Disadvantages:

Hotspot issues: Recent data concentrated on latest time range nodes, causing uneven access
Uneven data distribution: Some ranges may contain excessive data
Maintenance complexity: Requires continuous monitoring and range boundary adjustments

Applicable Scenarios: Data with obvious time characteristics (logs, orders), keys with uniform numeric distribution, businesses requiring frequent range queries

3. Hash-Based Sharding

Hash modulo is a basic and commonly used data sharding method.

Implementation Principles:

Integer Key handling: Node ID = Key % N
Other Key types: Node ID = Hash(Key) % N (N represents total nodes in current cluster)

Advantages:

Simple implementation, clear logic, low development cost
Good data distribution uniformity
Less prone to hotspot issues
High query efficiency

Disadvantages:

Difficult expansion: When adding nodes (e.g., from N to N+1), modulo values change for most data
Large data migration: Approximately (N/(N+1)) of data needs redistribution during expansion
Cannot achieve smooth expansion

Mathematical Analysis: Assuming original cluster has N nodes, expanding to N+1 nodes, data migration ratio = 1 - (N/(N+1)) ≈ 1/N. For example, expanding from 10 to 11 nodes requires migrating approximately 90.9% of data.

4. Consistent Hashing

Consistent Hash maps the database to a virtual hash ring where both data and nodes are mapped. For data, the first node encountered when moving clockwise from the data’s position on the ring is the storage node.

How It Works:

Hash ring construction: Organize entire hash value space into a virtual ring (0~2^32-1)
Node mapping: Calculate hash for each node (e.g., hash(IP:port)), map to ring
Data location: Calculate hash for data key, find first node going clockwise on ring as storage location

Key Advantages:

Minimal data migration: When adding nodes, only affects data between adjacent nodes
Virtual node technology: Solves data skew by assigning multiple virtual nodes (e.g., 200-300) to each physical node
Dynamic balancing: When nodes go online/offline, average data migration is only K/N

Typical Application Scenarios:

Redis cluster: Uses variant of consistent hashing with 16384 slots
Memcached: Clients commonly use consistent hashing for sharding
Distributed storage systems: Like Ceph’s CRUSH algorithm
Load balancing: Nginx’s consistent hash upstream load strategy