This is article 31 in the Big Data series. Deep analysis of ZooKeeper Leader election mechanism and ZAB (ZooKeeper Atomic Broadcast) protocol implementation principles.
Complete illustrated version: CSDN Original | Juejin
ZooKeeper Core Features Review
ZooKeeper provides distributed consistency guarantees:
- Sequential Consistency: All update operations executed in commit order
- Atomicity: Updates either all succeed or all fail
- Single System View: All clients see consistent data state
- Reliability: Once update completes, result persists until overwritten
- Timeliness: Client will definitely read latest data within limited time
Data model uses ZNode tree structure similar to Unix filesystem. Nodes store up to 1MB data by default, divided into persistent nodes (PERSISTENT) and ephemeral nodes (EPHEMERAL). Ephemeral nodes automatically delete after session ends.
Watcher Monitoring Mechanism
Clients can register Watcher on nodes, receive notifications when node data or child node list changes. Watcher is one-time trigger, needs re-registration after receiving notification.
Typical Application Scenarios
| Scenario | Implementation |
|---|---|
| Service Registration Discovery | Dubbo Provider creates ephemeral node under /dubbo/com.example.Service/providers |
| Configuration Management | Application monitors /config/myapp node, dynamically sense configuration changes |
| Message Queue | Producer creates sequential node, consumer monitors new node appearance |
| Distributed Lock | Multiple clients compete to create same ephemeral node, success gets lock |
| Cluster Management | Ephemeral nodes represent alive nodes, automatically delete after node goes offline |
Leader Election Mechanism
Core Principles
- Quorum Principle: Cluster can only serve normally when more than half nodes are alive
- Odd Number of Nodes: Recommend deploying 3, 5, 7 odd number of nodes to maximize fault tolerance efficiency
- Role Division: Only one Leader in cluster, all others are Followers
Initial Startup Election (5-Node Example)
Using 5-node cluster as example to explain election process on first startup:
- Node 1 starts: Sends election proposal, enters LOOKING state since no other nodes respond
- Node 2 starts: Exchanges votes with Node 1, Node 2 with larger ID wins, but still not reaching majority → continue LOOKING
- Node 3 starts: Three nodes reach consensus, Node 3 has largest ID, gets 3 votes exceeding majority → Node 3 becomes LEADER
- Node 4 starts: Perceives Leader exists → directly become FOLLOWER
- Node 5 starts: Same as above, join as FOLLOWER
Non-Initial Election
Re-election after Leader crashes during cluster operation, prioritize electing node with highest ZXID (transaction ID) as new Leader. ZXID reflects node’s latest data.
ZAB Protocol Detailed
Protocol Overview
ZAB (ZooKeeper Atomic Broadcast) is atomic broadcast protocol specifically designed for ZooKeeper, optimized from Paxos ideas, ensures:
- Messages either atomically delivered to all nodes or none at all
- Globally strict message ordering
- Reliable eventual consistency
Master-Slave Architecture
All client write requests must go through Leader node. Leader responsible for replicating updates to all Followers. Read requests can be handled by any Follower directly.
Message Broadcast Three Phases
Phase 1: Proposal
Leader assigns unique ZXID to each write request, encapsulates request as transaction proposal, sends to all Followers in FIFO queue sequentially.
Phase 2: ACK (Acknowledgment)
Each Follower receives proposal, first writes to local transaction log, then replies ACK to Leader. Leader waits for ACK from majority nodes (including itself). For example, 5-node cluster needs at least 3 ACKs.
Phase 3: Commit
Leader commits transaction locally, then broadcasts Commit message to all Followers. Followers apply proposal to local state machine upon receiving.
Difference from classic two-phase commit: uses majority instead of all confirm, avoids single node failure causing blocking.
Leader Fault Recovery
Fault Impact
- Service temporarily interrupted during re-election
- Client receives
ConnectionLossexception, needs retry logic - In-flight transactions may be interrupted
ZAB Recovery Guarantees
- Committed transactions: New Leader ensures transactions confirmed by majority nodes are eventually applied to all nodes by comparing ZXID across nodes
- Uncommitted transactions: Transactions only confirmed by minority nodes (e.g., 2/5) will be discarded by new Leader, maintaining consistency
Recovery Time: Typical Leader re-election takes 200~400 milliseconds.
Production Deployment Suggestions
- Node count: Recommend odd nodes, common 3 nodes (tolerate 1 failure) or 5 nodes (tolerate 2 failures)
- Fault tolerance: N-node cluster can tolerate
(N-1)/2node failures - Write routing: All write operations go through Leader, read operations can be distributed to Followers to improve throughput
- Client retry: Business code must handle
ConnectionLoss, implement idempotent retry
Summary
ZooKeeper achieves enterprise-level distributed coordination capability through ZAB protocol’s Leader election and atomic broadcast mechanism. “Quorum node” arbitration principle combined with ZXID ordering provides good fault tolerance while ensuring strong consistency, making it a mature choice for production-grade distributed system coordination services.