This is article 31 in the Big Data series. Deep analysis of ZooKeeper Leader election mechanism and ZAB (ZooKeeper Atomic Broadcast) protocol implementation principles.

Complete illustrated version: CSDN Original | Juejin

ZooKeeper Core Features Review

ZooKeeper provides distributed consistency guarantees:

  • Sequential Consistency: All update operations executed in commit order
  • Atomicity: Updates either all succeed or all fail
  • Single System View: All clients see consistent data state
  • Reliability: Once update completes, result persists until overwritten
  • Timeliness: Client will definitely read latest data within limited time

Data model uses ZNode tree structure similar to Unix filesystem. Nodes store up to 1MB data by default, divided into persistent nodes (PERSISTENT) and ephemeral nodes (EPHEMERAL). Ephemeral nodes automatically delete after session ends.

Watcher Monitoring Mechanism

Clients can register Watcher on nodes, receive notifications when node data or child node list changes. Watcher is one-time trigger, needs re-registration after receiving notification.

Typical Application Scenarios

ScenarioImplementation
Service Registration DiscoveryDubbo Provider creates ephemeral node under /dubbo/com.example.Service/providers
Configuration ManagementApplication monitors /config/myapp node, dynamically sense configuration changes
Message QueueProducer creates sequential node, consumer monitors new node appearance
Distributed LockMultiple clients compete to create same ephemeral node, success gets lock
Cluster ManagementEphemeral nodes represent alive nodes, automatically delete after node goes offline

Leader Election Mechanism

Core Principles

  • Quorum Principle: Cluster can only serve normally when more than half nodes are alive
  • Odd Number of Nodes: Recommend deploying 3, 5, 7 odd number of nodes to maximize fault tolerance efficiency
  • Role Division: Only one Leader in cluster, all others are Followers

Initial Startup Election (5-Node Example)

Using 5-node cluster as example to explain election process on first startup:

  1. Node 1 starts: Sends election proposal, enters LOOKING state since no other nodes respond
  2. Node 2 starts: Exchanges votes with Node 1, Node 2 with larger ID wins, but still not reaching majority → continue LOOKING
  3. Node 3 starts: Three nodes reach consensus, Node 3 has largest ID, gets 3 votes exceeding majority → Node 3 becomes LEADER
  4. Node 4 starts: Perceives Leader exists → directly become FOLLOWER
  5. Node 5 starts: Same as above, join as FOLLOWER

Non-Initial Election

Re-election after Leader crashes during cluster operation, prioritize electing node with highest ZXID (transaction ID) as new Leader. ZXID reflects node’s latest data.

ZAB Protocol Detailed

Protocol Overview

ZAB (ZooKeeper Atomic Broadcast) is atomic broadcast protocol specifically designed for ZooKeeper, optimized from Paxos ideas, ensures:

  • Messages either atomically delivered to all nodes or none at all
  • Globally strict message ordering
  • Reliable eventual consistency

Master-Slave Architecture

All client write requests must go through Leader node. Leader responsible for replicating updates to all Followers. Read requests can be handled by any Follower directly.

Message Broadcast Three Phases

Phase 1: Proposal

Leader assigns unique ZXID to each write request, encapsulates request as transaction proposal, sends to all Followers in FIFO queue sequentially.

Phase 2: ACK (Acknowledgment)

Each Follower receives proposal, first writes to local transaction log, then replies ACK to Leader. Leader waits for ACK from majority nodes (including itself). For example, 5-node cluster needs at least 3 ACKs.

Phase 3: Commit

Leader commits transaction locally, then broadcasts Commit message to all Followers. Followers apply proposal to local state machine upon receiving.

Difference from classic two-phase commit: uses majority instead of all confirm, avoids single node failure causing blocking.

Leader Fault Recovery

Fault Impact

  • Service temporarily interrupted during re-election
  • Client receives ConnectionLoss exception, needs retry logic
  • In-flight transactions may be interrupted

ZAB Recovery Guarantees

  • Committed transactions: New Leader ensures transactions confirmed by majority nodes are eventually applied to all nodes by comparing ZXID across nodes
  • Uncommitted transactions: Transactions only confirmed by minority nodes (e.g., 2/5) will be discarded by new Leader, maintaining consistency

Recovery Time: Typical Leader re-election takes 200~400 milliseconds.

Production Deployment Suggestions

  • Node count: Recommend odd nodes, common 3 nodes (tolerate 1 failure) or 5 nodes (tolerate 2 failures)
  • Fault tolerance: N-node cluster can tolerate (N-1)/2 node failures
  • Write routing: All write operations go through Leader, read operations can be distributed to Followers to improve throughput
  • Client retry: Business code must handle ConnectionLoss, implement idempotent retry

Summary

ZooKeeper achieves enterprise-level distributed coordination capability through ZAB protocol’s Leader election and atomic broadcast mechanism. “Quorum node” arbitration principle combined with ZXID ordering provides good fault tolerance while ensuring strong consistency, making it a mature choice for production-grade distributed system coordination services.