Scaling Triggers
- Capacity indicator: Disk usage exceeds 80% and expected to reach upper limit within 3 months
- Performance indicator: Query response time consistently exceeds SLA threshold (e.g., >500ms)
- Concurrency indicator: Active connections consistently maintained above 70% of maximum connections
- Monitoring alerts: Periodic CPU/IO saturation alerts
Horizontal Scaling Implementation Steps
1. Evaluation and Planning Phase
- Conduct capacity assessment, calculate current data growth curve
- Determine scaling ratio (e.g., increase 50% node count)
- Select scaling strategy: consistent hash scaling or range scaling
2. Data Migration Plan
Plan A: Online Migration (Recommended)
- Deploy new nodes and add to cluster
- Configure data synchronization mechanism (e.g., MySQL GTID replication)
- Migrate hot data in batches
- Switch traffic and verify
Plan B: Downtime Migration
- Stop write services
- Full backup of existing data
- Redistribute data to new and old nodes
- Restore services
3. Sharding Strategy Adjustment
- Refactor sharding key algorithm
- Update routing configuration (e.g., MyCat/ShardingSphere configuration)
- Test data balance
4. Application Layer Adaptation
- Update data source configuration
- Adjust connection pool parameters
- Modify potentially affected SQL statements
Common Challenges and Solutions
-
Data skew issue:
- Case: An e-commerce platform’s user table sharded by ID hash caused some shards to have 3x more data than others
- Solution: Use composite sharding keys (e.g., ID + registration time)
-
Cross-shard transactions:
- Introduce distributed transaction framework (e.g., Seata)
- Or adopt eventual consistency model
-
Scaling cost control:
- Use hybrid deployment strategy (SSD+HDD hybrid storage)
- Implement hot-cold data separation
Best Practices Case Study
Scaling experience from a social platform when daily active users exceeded 10 million:
- Scaled from 8 shards to 12 shards
- Used online migration method, took 72 hours
- QPS decrease controlled within 15% during migration
- After scaling, TP99 latency reduced from 800ms to 300ms
Downtime Scaling
Overview
Downtime scaling is a common approach in early database architecture evolution, suitable for scenarios where database size is relatively small and brief service interruptions are acceptable.
Detailed Implementation Steps
- Service announcement phase: Publish maintenance notice 3-5 days before scaling
- Service stop phase: Close load balancer traffic entry, stop all application service processes
- Data migration phase: Add new database instances, write migration scripts to change sharding rules
- Configuration update phase: Update database connection pool configuration, adjust sharding routing logic
- Service recovery phase: Start database services first, then start application services
Pros and Cons
Advantages:
- Simple and direct implementation, low technical difficulty
- No complex data synchronization mechanism needed
- One-time architecture adjustment completed
Limitations:
- Must operate with downtime, affecting business continuity
- Migration time increases significantly as data volume grows
- Not suitable for businesses requiring 7x24 high availability
Applicable Scenarios
- Early database expansion for startups
- Internal management system upgrades
- ToB services that can accept scheduled maintenance
- Migrations under TB-level data volume
Smooth Scaling
Overview
The core idea of smooth scaling is to adopt a gradual doubling strategy, incrementally increasing the number of databases through staged operations while keeping services uninterrupted.
Plan Characteristics
- Scaling ratio: Adopt 2x scaling strategy (e.g., expand from 2 DB nodes to 4 nodes)
- Technical requirements: Rely on dual-master replication mechanism for data synchronization
- Implementation phases: Divided into testing verification and production deployment two main phases
Detailed Implementation Steps
- Infrastructure preparation: Apply for new nodes, ensure same VPC as existing cluster
- Test environment verification: Build test cluster, verify data consistency
- Production environment deployment: Rolling deployment, configure middleware to gradually migrate traffic
- Ongoing maintenance: Deploy monitoring, perform regular maintenance
Advantages
- Business continuity guarantee: Services remain available during scaling
- Team pressure relief: More relaxed time window, allows phased execution
- Risk control advantage: Real-time monitoring, handle issues immediately
- Performance optimization effect: Reduced single database data volume brings significant performance improvement
Disadvantages
- High program complexity: Need to configure dual-master sync, dual-master dual-write
- High scaling cost: Operations cost surges, sync links grow quadratically
Applicable Scenarios
- Large websites
- Services with high availability requirements