1. Sharding Concepts
Sharding is a core technology in distributed database systems that determines how data is distributed across multiple storage devices. Imagine the entire database as a piece of glass - when shattered into multiple pieces, each piece becomes a database shard. Splitting a complete database into multiple shards is called sharding, a typical Scale-out solution.
Core Concept Distinctions:
- Sharding: Logical concept, referring to data partitioning strategies and methods, describing algorithms and rules for distributing data across nodes
- Database/Table Sharding: Physical implementation result, concrete manifestation of sharding strategy
Database Scaling Solutions Comparison:
- Scale-out: Expands system by adding more machines, splitting a single database into multiple instances, theoretically unlimited scalability
- Scale-up: Enhances system capability by improving single-machine performance, upgrading CPU, increasing memory, using faster storage
2. Range-Based Sharding
Range-based sharding distributes data across different storage nodes based on value ranges of specific fields. Common sharding keys include: user IDs (e.g., 0-1 million on node 1, 1-2 million on node 2), order timestamps (grouped by month), product prices, etc.
Advantages:
- Expansion-friendly: New data naturally falls on nodes for the new range; cluster expansion only needs to add nodes for new ranges without redistributing old data
- Query efficiency: Excellent range query performance, supports efficient sequential scans on sharding keys
Disadvantages:
- Hotspot issues: Recent data concentrated on latest time range nodes, causing uneven access
- Uneven data distribution: Some ranges may contain excessive data
- Maintenance complexity: Requires continuous monitoring and range boundary adjustments
Applicable Scenarios: Data with obvious time characteristics (logs, orders), keys with uniform numeric distribution, businesses requiring frequent range queries
3. Hash-Based Sharding
Hash modulo is a basic and commonly used data sharding method.
Implementation Principles:
- Integer Key handling:
Node ID = Key % N - Other Key types:
Node ID = Hash(Key) % N(N represents total nodes in current cluster)
Advantages:
- Simple implementation, clear logic, low development cost
- Good data distribution uniformity
- Less prone to hotspot issues
- High query efficiency
Disadvantages:
- Difficult expansion: When adding nodes (e.g., from N to N+1), modulo values change for most data
- Large data migration: Approximately (N/(N+1)) of data needs redistribution during expansion
- Cannot achieve smooth expansion
Mathematical Analysis: Assuming original cluster has N nodes, expanding to N+1 nodes, data migration ratio = 1 - (N/(N+1)) ≈ 1/N. For example, expanding from 10 to 11 nodes requires migrating approximately 90.9% of data.
4. Consistent Hashing
Consistent Hash maps the database to a virtual hash ring where both data and nodes are mapped. For data, the first node encountered when moving clockwise from the data’s position on the ring is the storage node.
How It Works:
- Hash ring construction: Organize entire hash value space into a virtual ring (0~2^32-1)
- Node mapping: Calculate hash for each node (e.g., hash(IP:port)), map to ring
- Data location: Calculate hash for data key, find first node going clockwise on ring as storage location
Key Advantages:
- Minimal data migration: When adding nodes, only affects data between adjacent nodes
- Virtual node technology: Solves data skew by assigning multiple virtual nodes (e.g., 200-300) to each physical node
- Dynamic balancing: When nodes go online/offline, average data migration is only K/N
Typical Application Scenarios:
- Redis cluster: Uses variant of consistent hashing with 16384 slots
- Memcached: Clients commonly use consistent hashing for sharding
- Distributed storage systems: Like Ceph’s CRUSH algorithm
- Load balancing: Nginx’s consistent hash upstream load strategy