Scaling Triggers

  1. Capacity indicator: Disk usage exceeds 80% and expected to reach upper limit within 3 months
  2. Performance indicator: Query response time consistently exceeds SLA threshold (e.g., >500ms)
  3. Concurrency indicator: Active connections consistently maintained above 70% of maximum connections
  4. Monitoring alerts: Periodic CPU/IO saturation alerts

Horizontal Scaling Implementation Steps

1. Evaluation and Planning Phase

  • Conduct capacity assessment, calculate current data growth curve
  • Determine scaling ratio (e.g., increase 50% node count)
  • Select scaling strategy: consistent hash scaling or range scaling

2. Data Migration Plan

Plan A: Online Migration (Recommended)

  1. Deploy new nodes and add to cluster
  2. Configure data synchronization mechanism (e.g., MySQL GTID replication)
  3. Migrate hot data in batches
  4. Switch traffic and verify

Plan B: Downtime Migration

  1. Stop write services
  2. Full backup of existing data
  3. Redistribute data to new and old nodes
  4. Restore services

3. Sharding Strategy Adjustment

  • Refactor sharding key algorithm
  • Update routing configuration (e.g., MyCat/ShardingSphere configuration)
  • Test data balance

4. Application Layer Adaptation

  • Update data source configuration
  • Adjust connection pool parameters
  • Modify potentially affected SQL statements

Common Challenges and Solutions

  1. Data skew issue:

    • Case: An e-commerce platform’s user table sharded by ID hash caused some shards to have 3x more data than others
    • Solution: Use composite sharding keys (e.g., ID + registration time)
  2. Cross-shard transactions:

    • Introduce distributed transaction framework (e.g., Seata)
    • Or adopt eventual consistency model
  3. Scaling cost control:

    • Use hybrid deployment strategy (SSD+HDD hybrid storage)
    • Implement hot-cold data separation

Best Practices Case Study

Scaling experience from a social platform when daily active users exceeded 10 million:

  1. Scaled from 8 shards to 12 shards
  2. Used online migration method, took 72 hours
  3. QPS decrease controlled within 15% during migration
  4. After scaling, TP99 latency reduced from 800ms to 300ms

Downtime Scaling

Overview

Downtime scaling is a common approach in early database architecture evolution, suitable for scenarios where database size is relatively small and brief service interruptions are acceptable.

Detailed Implementation Steps

  1. Service announcement phase: Publish maintenance notice 3-5 days before scaling
  2. Service stop phase: Close load balancer traffic entry, stop all application service processes
  3. Data migration phase: Add new database instances, write migration scripts to change sharding rules
  4. Configuration update phase: Update database connection pool configuration, adjust sharding routing logic
  5. Service recovery phase: Start database services first, then start application services

Pros and Cons

Advantages:

  • Simple and direct implementation, low technical difficulty
  • No complex data synchronization mechanism needed
  • One-time architecture adjustment completed

Limitations:

  • Must operate with downtime, affecting business continuity
  • Migration time increases significantly as data volume grows
  • Not suitable for businesses requiring 7x24 high availability

Applicable Scenarios

  • Early database expansion for startups
  • Internal management system upgrades
  • ToB services that can accept scheduled maintenance
  • Migrations under TB-level data volume

Smooth Scaling

Overview

The core idea of smooth scaling is to adopt a gradual doubling strategy, incrementally increasing the number of databases through staged operations while keeping services uninterrupted.

Plan Characteristics

  1. Scaling ratio: Adopt 2x scaling strategy (e.g., expand from 2 DB nodes to 4 nodes)
  2. Technical requirements: Rely on dual-master replication mechanism for data synchronization
  3. Implementation phases: Divided into testing verification and production deployment two main phases

Detailed Implementation Steps

  1. Infrastructure preparation: Apply for new nodes, ensure same VPC as existing cluster
  2. Test environment verification: Build test cluster, verify data consistency
  3. Production environment deployment: Rolling deployment, configure middleware to gradually migrate traffic
  4. Ongoing maintenance: Deploy monitoring, perform regular maintenance

Advantages

  1. Business continuity guarantee: Services remain available during scaling
  2. Team pressure relief: More relaxed time window, allows phased execution
  3. Risk control advantage: Real-time monitoring, handle issues immediately
  4. Performance optimization effect: Reduced single database data volume brings significant performance improvement

Disadvantages

  1. High program complexity: Need to configure dual-master sync, dual-master dual-write
  2. High scaling cost: Operations cost surges, sync links grow quadratically

Applicable Scenarios

  • Large websites
  • Services with high availability requirements