HBase Overall Architecture: HMaster, HRegionServer and Da...

This is article 33 in the Big Data series. Systematically reviews HBase overall architecture design, covering core component responsibilities, data model, and typical application scenarios.

Complete illustrated version: CSDN Original | Juejin

HBase Introduction

HBase is an open-source distributed NoSQL database based on Google BigTable paper design, using column-family storage model, running on HDFS, supporting PB-level data real-time random read/write.

Core differences from MySQL and other relational databases:

Comparison Dimension	MySQL	HBase
Storage Model	Row-oriented	Column-family
Data Scale	Hundred million rows	PB-level
Scaling	Vertical scaling mainly	Horizontal linear scaling
Transaction Support	Full ACID	Single-row ACID
Applicable Scenarios	Structured business data	Massive semi-structured data

Core Features

Massive Data Storage: Supports PB-level data, auto shards to multiple nodes, supports over 50% data compression rate, suitable for user logs, IoT sensor data, financial transaction records, etc.

High Availability and Horizontal Scaling: Master-slave architecture, RegionServers can be dynamically added online, storage capacity and compute power theoretically grow linearly, no downtime for scaling.

Column-Family Storage: Data physically stored by column family, same column family data stored together, empty fields don’t occupy storage, suitable for wide table scenarios (user profile tables with thousands of fields).

Strong Consistency: Row-level ACID transactions, MVCC multi-version concurrency control, WAL write-ahead log ensures data durability, suitable for financial transfers, inventory deduction, and other strong consistency scenarios.

Fast Random Read/Write: Multi-layer storage architecture—MemStore (write memory buffer) + HFile (disk persistence) + BlockCache (read cache), combined with BloomFilter to accelerate queries, typical response time < 10ms.

Data Model

HBase uses four-dimensional coordinates to locate a data unit:

(RowKey, ColumnFamily:Column, Timestamp) → Value

Dimension	Description
Row Key	Primary key, sorted by dictionary order, determines data physical distribution
Column Family	Pre-defined at table creation, physical storage unit, same column family data stored in same file
Column Qualifier	Specific column within column family, can be dynamically added, no pre-definition needed
Timestamp	Version control, uses write timestamp by default, can configure retained versions

All data stored as byte arrays, type conversion handled by application layer.

Overall Architecture

HBase works with four core components:

ZooKeeper — Coordination Service

Ensures only one active HMaster through election mechanism
Persistently stores hbase:meta metadata table location information
Heartbeat monitors each node status, default 30-second timeout detection, triggers failover

HMaster — Management Node

HMaster responsible for cluster-level management operations, does not directly participate in data read/write:

Region allocation and load balancing across nodes (runs every 5 minutes by default)
Maintain table structure information and column family configuration (Schema changes)
Take over WAL logs for data recovery when RegionServer fails
Coordinate DDL operations (create table, delete table, modify table structure)

HMaster supports master-standby high availability: ZooKeeper ensures only one active Master, standby Master ready in real-time.

HRegionServer — Data Node

RegionServer is the node that truly handles client read/write requests:

Hosts multiple Regions, single RegionServer manages about 100 Regions
Handles client Get / Put / Scan / Delete requests
MemStore (default 128MB) triggers Flush when full, generates HFile
Regularly executes Minor Compaction (merge small files) and Major Compaction (full merge, clean expired versions)

Region — Storage Unit

Region is the basic unit of HBase data sharding:

Each table divided into multiple Regions by RowKey range, each Region stores a continuous RowKey interval
Each column family corresponds to an independent Store within Region
Store contains:
- MemStore: Write buffer, ordered storage based on skip list, default 16MB
- StoreFile (HFile): Disk persisted file, contains BloomFilter index blocks
When Region size reaches threshold (default 10GB), automatically splits into two child Regions