Primary Key Strategies

In many small projects, we often use the database’s auto-increment feature to generate primary key IDs, which is indeed simple. However, in a sharded database environment, we can no longer rely on the database’s auto-increment feature to generate IDs directly, as it would cause primary key duplication across different data tables.

UUID (Universally Unique Identifier)

Basic Concepts

UUID is a 128-bit (16-byte) numeric identifier used to uniquely identify information in distributed systems. Its standard format consists of 32 hexadecimal digits, separated by hyphens into 5 groups in the form 8-4-4-4-12, with a total length of 36 characters (e.g., 550e8400-e29b-41d4-a716-446655440000).

Generation Mechanism

UUID generation typically combines multiple system information to ensure uniqueness:

  1. Network hardware information (e.g., MAC address)
  2. High-precision timestamp (nanosecond level)
  3. Hardware chip ID
  4. Random number generator
  5. Namespace (in specific versions)

Common versions include:

  • Version 1: Based on timestamp and MAC address
  • Version 4: Based on random number generation
  • Version 5: Based on namespace and hash value

Database Application Characteristics

Advantages

  1. Distributed generation: No central server coordination needed, each node can generate independently
  2. Uniqueness guarantee: Extremely low probability of duplication in theory (about 1 in 10^38)
  3. No network overhead: Local generation without network requests
  4. Security: Does not expose business information (compared to auto-increment ID)

Disadvantages

  1. Storage overhead: Occupies 16 bytes, twice that of BIGINT
  2. Index efficiency: In InnoDB engine, due to:
    • Unordered nature causes frequent B+ tree split and reorganization
    • Secondary indexes need to store complete primary key values
    • Increases memory usage and IO operations
  3. Poor readability: Difficult for humans to intuitively remember and identify

Special Impact on InnoDB Engine

  1. Clustered index issues: InnoDB uses primary key as clustered index; UUID unordered insertion causes:
    • Increased page split frequency
    • Data storage fragmentation
    • Decreased cache hit rate
  2. Secondary index bloat: Each secondary index contains the primary key value; 16-byte UUID causes:
    • Increased index file size
    • Fewer indexes cached in memory buffer pool
    • Need to load more data pages during range queries

Optimization Solutions

For scenarios requiring UUID:

  1. Use ordered UUID variants (e.g., COMB UUID)
  2. Store UUID as binary(16)
  3. Establish auto-increment ID as clustered index, use UUID as business key
  4. Consider short hash versions of UUID (need to evaluate collision probability)

Suitable scenarios for UUID:

  • Distributed systems requiring offline generation
  • Businesses needing to know primary key values in advance
  • Scenarios requiring hidden data scale
  • Multi-system data merge scenarios

COMB (UUID Variant) Details

Basic Concepts

COMB (combine) is a unique design concept in the database field, an improved GUID/UUID implementation. This design significantly improves database index and retrieval performance by combining traditional GUID/UUID with system time information.

Technical Background

Standard databases do not have a native COMB data type; this concept was first proposed and elaborated by Jimmy Nilsson in his technical article “The Cost of GUIDs as Primary Keys”.

Design Principles

COMB design is based on the following technical considerations:

  1. Traditional GUID/UUID is a completely random 128-bit identifier
  2. This randomness causes serious index fragmentation in databases
  3. Random distribution during data insertion leads to low index efficiency, affecting overall system performance

Specific Implementation

COMB reconstructs GUID using segmented combination:

  1. Retained portion: Keep the first 10 bytes (80 bits) of GUID unchanged to ensure uniqueness
  2. Time portion: Use last 6 bytes (48 bits) to store timestamp of GUID generation (DateTime)
    • Precise to millisecond-level time information
    • Timestamp stored in big-endian order

Performance Advantages

This combination brings significant advantages:

  1. Uniqueness preserved: First 10 bytes still guarantee global uniqueness
  2. Added ordering: Timestamp makes newly generated IDs show an increasing trend
  3. Index optimization:
    • Reduced index fragmentation
    • Improved range query efficiency
    • Optimized data page fill rate

Typical Application Scenarios

  1. Distributed database primary key design
  2. High-concurrency order systems
  3. Log systems requiring frequent insertion
  4. Large e-commerce platform product ID generation

SnowFlake Distributed ID Generation Algorithm

In distributed systems, we often need a globally unique and time-ordered ID generation solution. SnowFlake is precisely a distributed ID generation algorithm open-sourced by Twitter to solve this problem. The IDs it generates are 64-bit long integers.

Data Structure Analysis

SnowFlake’s 64-bit ID consists of:

  1. Sign bit (1bit): Always 0, ensuring generated IDs are positive
  2. Timestamp section (41bit):
    • Records timestamp when ID was generated (millisecond level)
    • 41 bits can represent time span of approximately 69 years (2^41/1000/60/60/24/365)
    • Usually calculated from system launch time, e.g., 2020-01-01 00:00:00
  3. Worker machine ID (10bit):
    • High 5 bits represent datacenter ID (supports up to 32 datacenters)
    • Low 5 bits represent machine ID (supports up to 32 machines per datacenter)
    • This design can support up to 1024 machines (32*32)
  4. Sequence number (12bit):
    • Sequence number for different IDs generated within the same millisecond
    • 12 bits support 4096 IDs per node per millisecond (2^12)

Workflow

  1. When receiving an ID generation request, first get current timestamp
  2. If current timestamp is less than the timestamp of the last generated ID, system clock rollback is detected and an exception is thrown
  3. If requests are within the same millisecond, increment sequence number
  4. If sequence number overflows, wait until next millisecond
  5. Finally, combine each part’s value through bitwise operations to form the final 64-bit ID

Application Scenarios

  1. Distributed systems: As globally unique transaction IDs
  2. Database primary keys: Replace auto-increment IDs to avoid ID conflicts during sharding
  3. Message queues: As unique message identifiers
  4. Log tracking: As request chain tracking IDs

Advantages and Limitations

Advantages:

  • ID self-incrementing trend benefits database indexes
  • No dependency on third-party services, local generation
  • High performance, can generate millions of IDs per second per machine

Limitations:

  • Depends on system clock, clock rollback causes ID duplication
  • Worker machine IDs need pre-configuration, not conducive to dynamic expansion

Database ID Table (Distributed ID Generation)

Core Principle

Maintain an independent database table dedicated to generating globally unique IDs, using MySQL’s auto-increment ID feature to implement ID generation in a distributed environment.

Implementation Details

  1. Independent ID database construction

    • Create a separate MySQL database instance (e.g., named id_generator_db)
    • Create a table dedicated to ID generation in this database (e.g., global_id_table)
  2. ID generation process

    • When business system needs an ID, execute the following SQL:
    • After obtaining the ID, it can be used for inserting into business tables
  3. Sharded table scenario application

    • Taking Table A sharding as an example:
      • First get global ID from global_id_table
      • Based on sharding rules (e.g., ID modulo) decide whether to insert into A1 or A2 table

Optimization and Precautions

  1. Performance optimization
    • Use connection pool to manage ID database connections
    • Batch ID acquisition: Allocate ID segments in batches by setting auto_increment_increment parameter
  2. High availability solution
    • Deploy master-slave architecture to avoid single point of failure
    • Consider multi-datacenter deployment of ID generation service
  3. Usage limitations
    • Single database throughput is limited (about 10k-20k QPS)
    • Cross-datacenter calls may cause network latency
    • Need to pay attention to auto-increment ID overflow issues (use bigint type)

Redis ID Generation

Background and Requirements

In distributed systems, generating globally unique IDs is a common requirement. Traditional database auto-increment IDs may face performance bottlenecks under high concurrency.

Advantages of Redis Solution

Redis, as an in-memory database, has the following characteristics that make it suitable for ID generation:

  1. Single-threaded model ensures atomic operations
  2. High performance (100k+ QPS)
  3. Supports persistence, ensuring data security

Implementation Methods

1. Basic INCR Command

INCR id_counter
  • Each execution automatically increments the key value by 1
  • Returns the new integer value as ID

2. Batch ID Generation (INCRBY)

INCRBY id_counter 1000
  • Acquire a range of IDs at once
  • Suitable for batch operation scenarios

3. Timestamp Combination Mode

INCR daily_counter
  • Generate format: date (8 digits) + auto-increment sequence (6 digits)
  • For example: 20230515-000001
  • Counter resets automatically daily

Application Scenarios

  1. Order system: Generate unique order numbers
  2. Log system: Mark each log with unique ID
  3. Distributed locks: Implement lock mechanism based on ID
  4. Message queues: Unique message identifier

Performance Comparison

SolutionQPSProsCons
Database auto-increment ID1k-5kSimple and reliableLimited performance
Redis INCR50k+High performance, atomicRequires Redis service maintenance
UUID100k+No central node neededLong ID, unordered