MySQL Distributed Primary Key Strategies: UUID, SnowFlake...

Primary Key Strategies

In many small projects, we often use the database’s auto-increment feature to generate primary key IDs, which is indeed simple. However, in a sharded database environment, we can no longer rely on the database’s auto-increment feature to generate IDs directly, as it would cause primary key duplication across different data tables.

UUID (Universally Unique Identifier)

Basic Concepts

UUID is a 128-bit (16-byte) numeric identifier used to uniquely identify information in distributed systems. Its standard format consists of 32 hexadecimal digits, separated by hyphens into 5 groups in the form 8-4-4-4-12, with a total length of 36 characters (e.g., 550e8400-e29b-41d4-a716-446655440000).

Generation Mechanism

UUID generation typically combines multiple system information to ensure uniqueness:

Network hardware information (e.g., MAC address)
High-precision timestamp (nanosecond level)
Hardware chip ID
Random number generator
Namespace (in specific versions)

Common versions include:

Version 1: Based on timestamp and MAC address
Version 4: Based on random number generation
Version 5: Based on namespace and hash value

Database Application Characteristics

Advantages

Distributed generation: No central server coordination needed, each node can generate independently
Uniqueness guarantee: Extremely low probability of duplication in theory (about 1 in 10^38)
No network overhead: Local generation without network requests
Security: Does not expose business information (compared to auto-increment ID)

Disadvantages

Storage overhead: Occupies 16 bytes, twice that of BIGINT
Index efficiency: In InnoDB engine, due to:
- Unordered nature causes frequent B+ tree split and reorganization
- Secondary indexes need to store complete primary key values
- Increases memory usage and IO operations
Poor readability: Difficult for humans to intuitively remember and identify

Special Impact on InnoDB Engine

Clustered index issues: InnoDB uses primary key as clustered index; UUID unordered insertion causes:
- Increased page split frequency
- Data storage fragmentation
- Decreased cache hit rate
Secondary index bloat: Each secondary index contains the primary key value; 16-byte UUID causes:
- Increased index file size
- Fewer indexes cached in memory buffer pool
- Need to load more data pages during range queries

Optimization Solutions

For scenarios requiring UUID:

Use ordered UUID variants (e.g., COMB UUID)
Store UUID as binary(16)
Establish auto-increment ID as clustered index, use UUID as business key
Consider short hash versions of UUID (need to evaluate collision probability)

Recommended Application Scenarios

Suitable scenarios for UUID:

Distributed systems requiring offline generation
Businesses needing to know primary key values in advance
Scenarios requiring hidden data scale
Multi-system data merge scenarios

COMB (UUID Variant) Details

Basic Concepts

COMB (combine) is a unique design concept in the database field, an improved GUID/UUID implementation. This design significantly improves database index and retrieval performance by combining traditional GUID/UUID with system time information.

Technical Background

Standard databases do not have a native COMB data type; this concept was first proposed and elaborated by Jimmy Nilsson in his technical article “The Cost of GUIDs as Primary Keys”.

Design Principles

COMB design is based on the following technical considerations:

Traditional GUID/UUID is a completely random 128-bit identifier
This randomness causes serious index fragmentation in databases
Random distribution during data insertion leads to low index efficiency, affecting overall system performance

Specific Implementation

COMB reconstructs GUID using segmented combination:

Retained portion: Keep the first 10 bytes (80 bits) of GUID unchanged to ensure uniqueness
Time portion: Use last 6 bytes (48 bits) to store timestamp of GUID generation (DateTime)
- Precise to millisecond-level time information
- Timestamp stored in big-endian order

Performance Advantages

This combination brings significant advantages:

Uniqueness preserved: First 10 bytes still guarantee global uniqueness
Added ordering: Timestamp makes newly generated IDs show an increasing trend
Index optimization:
- Reduced index fragmentation
- Improved range query efficiency
- Optimized data page fill rate

Typical Application Scenarios

Distributed database primary key design
High-concurrency order systems
Log systems requiring frequent insertion
Large e-commerce platform product ID generation

SnowFlake Distributed ID Generation Algorithm

In distributed systems, we often need a globally unique and time-ordered ID generation solution. SnowFlake is precisely a distributed ID generation algorithm open-sourced by Twitter to solve this problem. The IDs it generates are 64-bit long integers.

Data Structure Analysis

SnowFlake’s 64-bit ID consists of:

Sign bit (1bit): Always 0, ensuring generated IDs are positive
Timestamp section (41bit):
- Records timestamp when ID was generated (millisecond level)
- 41 bits can represent time span of approximately 69 years (2^41/1000/60/60/24/365)
- Usually calculated from system launch time, e.g., 2020-01-01 00:00:00
Worker machine ID (10bit):
- High 5 bits represent datacenter ID (supports up to 32 datacenters)
- Low 5 bits represent machine ID (supports up to 32 machines per datacenter)
- This design can support up to 1024 machines (32*32)
Sequence number (12bit):
- Sequence number for different IDs generated within the same millisecond
- 12 bits support 4096 IDs per node per millisecond (2^12)

Workflow

When receiving an ID generation request, first get current timestamp
If current timestamp is less than the timestamp of the last generated ID, system clock rollback is detected and an exception is thrown
If requests are within the same millisecond, increment sequence number
If sequence number overflows, wait until next millisecond
Finally, combine each part’s value through bitwise operations to form the final 64-bit ID

Application Scenarios

Distributed systems: As globally unique transaction IDs
Database primary keys: Replace auto-increment IDs to avoid ID conflicts during sharding
Message queues: As unique message identifiers
Log tracking: As request chain tracking IDs

Advantages and Limitations

Advantages:

ID self-incrementing trend benefits database indexes
No dependency on third-party services, local generation
High performance, can generate millions of IDs per second per machine

Limitations:

Depends on system clock, clock rollback causes ID duplication
Worker machine IDs need pre-configuration, not conducive to dynamic expansion

Database ID Table (Distributed ID Generation)

Core Principle

Maintain an independent database table dedicated to generating globally unique IDs, using MySQL’s auto-increment ID feature to implement ID generation in a distributed environment.

Implementation Details

Independent ID database construction
- Create a separate MySQL database instance (e.g., named id_generator_db)
- Create a table dedicated to ID generation in this database (e.g., global_id_table)
ID generation process
- When business system needs an ID, execute the following SQL:
- After obtaining the ID, it can be used for inserting into business tables
Sharded table scenario application
- Taking Table A sharding as an example:
  - First get global ID from global_id_table
  - Based on sharding rules (e.g., ID modulo) decide whether to insert into A1 or A2 table

Optimization and Precautions

Performance optimization
- Use connection pool to manage ID database connections
- Batch ID acquisition: Allocate ID segments in batches by setting auto_increment_increment parameter
High availability solution
- Deploy master-slave architecture to avoid single point of failure
- Consider multi-datacenter deployment of ID generation service
Usage limitations
- Single database throughput is limited (about 10k-20k QPS)
- Cross-datacenter calls may cause network latency
- Need to pay attention to auto-increment ID overflow issues (use bigint type)

Redis ID Generation

Background and Requirements

In distributed systems, generating globally unique IDs is a common requirement. Traditional database auto-increment IDs may face performance bottlenecks under high concurrency.

Advantages of Redis Solution

Redis, as an in-memory database, has the following characteristics that make it suitable for ID generation:

Single-threaded model ensures atomic operations
High performance (100k+ QPS)
Supports persistence, ensuring data security

Implementation Methods

1. Basic INCR Command

INCR id_counter

Each execution automatically increments the key value by 1
Returns the new integer value as ID

2. Batch ID Generation (INCRBY)

INCRBY id_counter 1000

Acquire a range of IDs at once
Suitable for batch operation scenarios

3. Timestamp Combination Mode

INCR daily_counter

Generate format: date (8 digits) + auto-increment sequence (6 digits)
For example: 20230515-000001
Counter resets automatically daily

Application Scenarios

Order system: Generate unique order numbers
Log system: Mark each log with unique ID
Distributed locks: Implement lock mechanism based on ID
Message queues: Unique message identifier

Performance Comparison

Solution	QPS	Pros	Cons
Database auto-increment ID	1k-5k	Simple and reliable	Limited performance
Redis INCR	50k+	High performance, atomic	Requires Redis service maintenance
UUID	100k+	No central node needed	Long ID, unordered