Distributed System Overview
Distributed systems achieve better throughput, performance metrics, and system availability compared to single high-performance servers by combining multiple ordinary-performance servers to work together at relatively low hardware cost.
Four Core Design Strategies
-
Node Health Check
- Heartbeat mechanism: Nodes regularly send heartbeat packets to the monitoring center
- Timeout detection: Set reasonable response timeout periods
- Probe checking: Probe key service ports
- Examples: Zookeeper’s Session mechanism, Kubernetes’ Liveness Probe
-
High Availability Guarantee
- Redundant design: Multi-node deployment to avoid single point of failure
- Failover: Primary-secondary switching mechanism (such as Redis Sentinel)
- Service degradation: Core and non-core service isolation
-
Fault Tolerance
- Retry mechanism: Automatic retry for temporary failures
- Circuit breaker: Prevent fault propagation (such as Hystrix)
- Data consistency: Distributed transaction processing (2PC, TCC)
-
Load Balancing
- Algorithm selection: Round-robin, weighted, least connections, etc.
- Dynamic adjustment: Real-time scheduling based on node load
- Multi-level load: DNS→LVS→Nginx→Service layer
Heartbeat Detection Mechanism Details
Heartbeat detection is a commonly used technology in distributed systems for monitoring node survival status.
Core Implementation Methods
1. Active Push Mode
- Monitored node (Client) regularly sends heartbeat packets to monitoring node (Server)
- Typical interval: 30 seconds to 2 minutes
- Packet example: Contains nodeID, timestamp, load, and other information
2. Passive Request Mode
- Monitoring node actively polls each monitored node
- Uses GET/POST requests or dedicated health check interface
- Timeout is usually set to 2-3 times the heartbeat interval
Information Carried
Heartbeat packets usually contain:
- Basic status: CPU usage, memory usage, disk space
- Business metrics: Current connection count, request processing volume, queue length
- Metadata: Node role, service version, configuration hash value
Periodic Detection Heartbeat Mechanism
The Server side establishes a periodic detection task that sends heartbeat detection requests to all nodes in the Node cluster at fixed time intervals t seconds (for example, t=5 seconds). Each heartbeat request carries a preset timeout threshold (for example, 3 seconds):
- First timeout: Mark the node as “suspected failure” status
- Consecutive N timeouts (for example, N=3): Formally determine the node as “dead” status
Accumulated Failure Detection Mechanism
Based on the basic heartbeat mechanism, the system maintains a sliding window statistical model to record the response status of each node:
- Statistical metrics: Successful response count, timeout count, average response time
- Calculate node health score based on historical data
- Mark as corresponding status when the score falls below the threshold
- Initiate intensive detection process for “near-dead” nodes
Challenges and Solutions
- Network Partition Problem: Introduce third-party arbitration node, use lease mechanism to avoid split-brain
- Resource Consumption Balance: Use lightweight binary protocol, heartbeat aggregation compression technology
- Clock Drift Handling: NTP time synchronization, logical clock compensation mechanism
High Availability Design
High Availability (HA) refers to using specific architectural designs and technical means to maximize the reduction of system downtime.
Core HA Metrics
- MTBF: Mean Time Between Failures - average time of normal system operation
- MTTR: Mean Time To Repair - average time to restore service after failure
- SLA: Committed service availability percentage
HA Models
1. Master-Slave Model
The primary-backup architecture is the most basic high-availability solution, consisting of one primary node (Master) and one or more backup nodes (Slave). The primary node handles all business requests, while backup nodes are on standby and synchronize data with the primary node at regular intervals. When the primary node fails, the backup node takes over service through failover mechanism.
- Advantages: Simple implementation, low resource utilization requirements
- Disadvantages: Backup nodes do not provide services during normal operation, resources are wasted
- Application scenarios: Database primary-replica replication (such as MySQL primary-replica architecture), load balancer primary-backup deployment
2. Active-Active Model
The active-active architecture, also known as dual-active mode, has all nodes in active state, serving external requests simultaneously and backing up each other. When a node fails, its load automatically transfers to other nodes.
- Advantages: High resource utilization, no single point of failure
- Disadvantages: High implementation complexity, need to consider data consistency issues
- Application scenarios: Distributed database clusters (such as MongoDB replica sets), multi-active data center deployment
3. Cluster Model
The cluster architecture consists of multiple nodes forming a logical whole, managed by distributed coordination services.
- Key characteristics: Node equivalence, automatic failover, elastic scaling
- Typical implementations: Kubernetes container orchestration clusters, Hadoop big data clusters, Redis Cluster
Fault Tolerance
Fault tolerance ensures system high availability or robustness in distributed environments.
Cache Penetration Problem Solutions
Using a clever method: pre-set a value for non-existent keys, such as key = null. When returning this null value, the application can decide whether to continue waiting for access or give up the operation.
Load Balancing
The key is to use multiple cluster servers to jointly share computing tasks.
Load Balancing Strategies
- Round-robin: Distribute client requests to different backend servers in sequence
- Least connections: Whoever has the fewest current connections gets the distribution
- IP Hash: Ensure requests from the same IP are forwarded to the same backend node
- Weight-based: Configure more requests to be distributed to high-spec servers
Load Balancing Tools
- Hardware solutions: F5
- Software solutions: LVS, HAProxy, Nginx