Distributed Services: Heartbeat Detection and High Availa...

Distributed System Overview

Distributed systems achieve better throughput, performance metrics, and system availability compared to single high-performance servers by combining multiple ordinary-performance servers to work together at relatively low hardware cost.

Four Core Design Strategies

Node Health Check
- Heartbeat mechanism: Nodes regularly send heartbeat packets to the monitoring center
- Timeout detection: Set reasonable response timeout periods
- Probe checking: Probe key service ports
- Examples: Zookeeper’s Session mechanism, Kubernetes’ Liveness Probe
High Availability Guarantee
- Redundant design: Multi-node deployment to avoid single point of failure
- Failover: Primary-secondary switching mechanism (such as Redis Sentinel)
- Service degradation: Core and non-core service isolation
Fault Tolerance
- Retry mechanism: Automatic retry for temporary failures
- Circuit breaker: Prevent fault propagation (such as Hystrix)
- Data consistency: Distributed transaction processing (2PC, TCC)
Load Balancing
- Algorithm selection: Round-robin, weighted, least connections, etc.
- Dynamic adjustment: Real-time scheduling based on node load
- Multi-level load: DNS→LVS→Nginx→Service layer

Heartbeat Detection Mechanism Details

Heartbeat detection is a commonly used technology in distributed systems for monitoring node survival status.

Core Implementation Methods

1. Active Push Mode

Monitored node (Client) regularly sends heartbeat packets to monitoring node (Server)
Typical interval: 30 seconds to 2 minutes
Packet example: Contains nodeID, timestamp, load, and other information

2. Passive Request Mode

Monitoring node actively polls each monitored node
Uses GET/POST requests or dedicated health check interface
Timeout is usually set to 2-3 times the heartbeat interval

Information Carried

Heartbeat packets usually contain:

Basic status: CPU usage, memory usage, disk space
Business metrics: Current connection count, request processing volume, queue length
Metadata: Node role, service version, configuration hash value

Periodic Detection Heartbeat Mechanism

The Server side establishes a periodic detection task that sends heartbeat detection requests to all nodes in the Node cluster at fixed time intervals t seconds (for example, t=5 seconds). Each heartbeat request carries a preset timeout threshold (for example, 3 seconds):

First timeout: Mark the node as “suspected failure” status
Consecutive N timeouts (for example, N=3): Formally determine the node as “dead” status

Accumulated Failure Detection Mechanism

Based on the basic heartbeat mechanism, the system maintains a sliding window statistical model to record the response status of each node:

Statistical metrics: Successful response count, timeout count, average response time
Calculate node health score based on historical data
Mark as corresponding status when the score falls below the threshold
Initiate intensive detection process for “near-dead” nodes

Challenges and Solutions

Network Partition Problem: Introduce third-party arbitration node, use lease mechanism to avoid split-brain
Resource Consumption Balance: Use lightweight binary protocol, heartbeat aggregation compression technology
Clock Drift Handling: NTP time synchronization, logical clock compensation mechanism

High Availability Design

High Availability (HA) refers to using specific architectural designs and technical means to maximize the reduction of system downtime.

Core HA Metrics

MTBF: Mean Time Between Failures - average time of normal system operation
MTTR: Mean Time To Repair - average time to restore service after failure
SLA: Committed service availability percentage

HA Models

1. Master-Slave Model

The primary-backup architecture is the most basic high-availability solution, consisting of one primary node (Master) and one or more backup nodes (Slave). The primary node handles all business requests, while backup nodes are on standby and synchronize data with the primary node at regular intervals. When the primary node fails, the backup node takes over service through failover mechanism.

Advantages: Simple implementation, low resource utilization requirements
Disadvantages: Backup nodes do not provide services during normal operation, resources are wasted
Application scenarios: Database primary-replica replication (such as MySQL primary-replica architecture), load balancer primary-backup deployment

2. Active-Active Model

The active-active architecture, also known as dual-active mode, has all nodes in active state, serving external requests simultaneously and backing up each other. When a node fails, its load automatically transfers to other nodes.

Advantages: High resource utilization, no single point of failure
Disadvantages: High implementation complexity, need to consider data consistency issues
Application scenarios: Distributed database clusters (such as MongoDB replica sets), multi-active data center deployment

3. Cluster Model

The cluster architecture consists of multiple nodes forming a logical whole, managed by distributed coordination services.

Key characteristics: Node equivalence, automatic failover, elastic scaling
Typical implementations: Kubernetes container orchestration clusters, Hadoop big data clusters, Redis Cluster

Fault Tolerance

Fault tolerance ensures system high availability or robustness in distributed environments.

Cache Penetration Problem Solutions

Using a clever method: pre-set a value for non-existent keys, such as key = null. When returning this null value, the application can decide whether to continue waiting for access or give up the operation.

Load Balancing

The key is to use multiple cluster servers to jointly share computing tasks.

Load Balancing Strategies

Round-robin: Distribute client requests to different backend servers in sequence
Least connections: Whoever has the fewest current connections gets the distribution
IP Hash: Ensure requests from the same IP are forwarded to the same backend node
Weight-based: Configure more requests to be distributed to high-spec servers

Load Balancing Tools

Hardware solutions: F5
Software solutions: LVS, HAProxy, Nginx