ZooKeeper Leader Election and ZAB Protocol Principles

This is article 31 in the Big Data series. Deep analysis of ZooKeeper Leader election mechanism and ZAB (ZooKeeper Atomic Broadcast) protocol implementation principles.

Complete illustrated version: CSDN Original | Juejin

ZooKeeper Core Features Review

ZooKeeper provides distributed consistency guarantees:

Sequential Consistency: All update operations executed in commit order
Atomicity: Updates either all succeed or all fail
Single System View: All clients see consistent data state
Reliability: Once update completes, result persists until overwritten
Timeliness: Client will definitely read latest data within limited time

Data model uses ZNode tree structure similar to Unix filesystem. Nodes store up to 1MB data by default, divided into persistent nodes (PERSISTENT) and ephemeral nodes (EPHEMERAL). Ephemeral nodes automatically delete after session ends.

Watcher Monitoring Mechanism

Clients can register Watcher on nodes, receive notifications when node data or child node list changes. Watcher is one-time trigger, needs re-registration after receiving notification.

Typical Application Scenarios

Scenario	Implementation
Service Registration Discovery	Dubbo Provider creates ephemeral node under `/dubbo/com.example.Service/providers`
Configuration Management	Application monitors `/config/myapp` node, dynamically sense configuration changes
Message Queue	Producer creates sequential node, consumer monitors new node appearance
Distributed Lock	Multiple clients compete to create same ephemeral node, success gets lock
Cluster Management	Ephemeral nodes represent alive nodes, automatically delete after node goes offline

Leader Election Mechanism

Core Principles

Quorum Principle: Cluster can only serve normally when more than half nodes are alive
Odd Number of Nodes: Recommend deploying 3, 5, 7 odd number of nodes to maximize fault tolerance efficiency
Role Division: Only one Leader in cluster, all others are Followers

Initial Startup Election (5-Node Example)

Using 5-node cluster as example to explain election process on first startup:

Node 1 starts: Sends election proposal, enters LOOKING state since no other nodes respond
Node 2 starts: Exchanges votes with Node 1, Node 2 with larger ID wins, but still not reaching majority → continue LOOKING
Node 3 starts: Three nodes reach consensus, Node 3 has largest ID, gets 3 votes exceeding majority → Node 3 becomes LEADER
Node 4 starts: Perceives Leader exists → directly become FOLLOWER
Node 5 starts: Same as above, join as FOLLOWER

Non-Initial Election

Re-election after Leader crashes during cluster operation, prioritize electing node with highest ZXID (transaction ID) as new Leader. ZXID reflects node’s latest data.

ZAB Protocol Detailed

Protocol Overview

ZAB (ZooKeeper Atomic Broadcast) is atomic broadcast protocol specifically designed for ZooKeeper, optimized from Paxos ideas, ensures:

Messages either atomically delivered to all nodes or none at all
Globally strict message ordering
Reliable eventual consistency

Master-Slave Architecture

All client write requests must go through Leader node. Leader responsible for replicating updates to all Followers. Read requests can be handled by any Follower directly.

Message Broadcast Three Phases

Phase 1: Proposal

Leader assigns unique ZXID to each write request, encapsulates request as transaction proposal, sends to all Followers in FIFO queue sequentially.

Phase 2: ACK (Acknowledgment)

Each Follower receives proposal, first writes to local transaction log, then replies ACK to Leader. Leader waits for ACK from majority nodes (including itself). For example, 5-node cluster needs at least 3 ACKs.

Phase 3: Commit

Leader commits transaction locally, then broadcasts Commit message to all Followers. Followers apply proposal to local state machine upon receiving.

Difference from classic two-phase commit: uses majority instead of all confirm, avoids single node failure causing blocking.

Leader Fault Recovery

Fault Impact

Service temporarily interrupted during re-election
Client receives ConnectionLoss exception, needs retry logic
In-flight transactions may be interrupted

ZAB Recovery Guarantees

Committed transactions: New Leader ensures transactions confirmed by majority nodes are eventually applied to all nodes by comparing ZXID across nodes
Uncommitted transactions: Transactions only confirmed by minority nodes (e.g., 2/5) will be discarded by new Leader, maintaining consistency

Recovery Time: Typical Leader re-election takes 200~400 milliseconds.

Production Deployment Suggestions

Node count: Recommend odd nodes, common 3 nodes (tolerate 1 failure) or 5 nodes (tolerate 2 failures)
Fault tolerance: N-node cluster can tolerate (N-1)/2 node failures
Write routing: All write operations go through Leader, read operations can be distributed to Followers to improve throughput
Client retry: Business code must handle ConnectionLoss, implement idempotent retry

Summary

ZooKeeper achieves enterprise-level distributed coordination capability through ZAB protocol’s Leader election and atomic broadcast mechanism. “Quorum node” arbitration principle combined with ZXID ordering provides good fault tolerance while ensuring strong consistency, making it a mature choice for production-grade distributed system coordination services.