This is article 33 in the Big Data series. Systematically reviews HBase overall architecture design, covering core component responsibilities, data model, and typical application scenarios.
Complete illustrated version: CSDN Original | Juejin
HBase Introduction
HBase is an open-source distributed NoSQL database based on Google BigTable paper design, using column-family storage model, running on HDFS, supporting PB-level data real-time random read/write.
Core differences from MySQL and other relational databases:
| Comparison Dimension | MySQL | HBase |
|---|---|---|
| Storage Model | Row-oriented | Column-family |
| Data Scale | Hundred million rows | PB-level |
| Scaling | Vertical scaling mainly | Horizontal linear scaling |
| Transaction Support | Full ACID | Single-row ACID |
| Applicable Scenarios | Structured business data | Massive semi-structured data |
Core Features
Massive Data Storage: Supports PB-level data, auto shards to multiple nodes, supports over 50% data compression rate, suitable for user logs, IoT sensor data, financial transaction records, etc.
High Availability and Horizontal Scaling: Master-slave architecture, RegionServers can be dynamically added online, storage capacity and compute power theoretically grow linearly, no downtime for scaling.
Column-Family Storage: Data physically stored by column family, same column family data stored together, empty fields don’t occupy storage, suitable for wide table scenarios (user profile tables with thousands of fields).
Strong Consistency: Row-level ACID transactions, MVCC multi-version concurrency control, WAL write-ahead log ensures data durability, suitable for financial transfers, inventory deduction, and other strong consistency scenarios.
Fast Random Read/Write: Multi-layer storage architecture—MemStore (write memory buffer) + HFile (disk persistence) + BlockCache (read cache), combined with BloomFilter to accelerate queries, typical response time < 10ms.
Data Model
HBase uses four-dimensional coordinates to locate a data unit:
(RowKey, ColumnFamily:Column, Timestamp) → Value
| Dimension | Description |
|---|---|
| Row Key | Primary key, sorted by dictionary order, determines data physical distribution |
| Column Family | Pre-defined at table creation, physical storage unit, same column family data stored in same file |
| Column Qualifier | Specific column within column family, can be dynamically added, no pre-definition needed |
| Timestamp | Version control, uses write timestamp by default, can configure retained versions |
All data stored as byte arrays, type conversion handled by application layer.
Overall Architecture
HBase works with four core components:
ZooKeeper — Coordination Service
- Ensures only one active HMaster through election mechanism
- Persistently stores
hbase:metametadata table location information - Heartbeat monitors each node status, default 30-second timeout detection, triggers failover
HMaster — Management Node
HMaster responsible for cluster-level management operations, does not directly participate in data read/write:
- Region allocation and load balancing across nodes (runs every 5 minutes by default)
- Maintain table structure information and column family configuration (Schema changes)
- Take over WAL logs for data recovery when RegionServer fails
- Coordinate DDL operations (create table, delete table, modify table structure)
HMaster supports master-standby high availability: ZooKeeper ensures only one active Master, standby Master ready in real-time.
HRegionServer — Data Node
RegionServer is the node that truly handles client read/write requests:
- Hosts multiple Regions, single RegionServer manages about 100 Regions
- Handles client Get / Put / Scan / Delete requests
- MemStore (default 128MB) triggers Flush when full, generates HFile
- Regularly executes Minor Compaction (merge small files) and Major Compaction (full merge, clean expired versions)
Region — Storage Unit
Region is the basic unit of HBase data sharding:
- Each table divided into multiple Regions by RowKey range, each Region stores a continuous RowKey interval
- Each column family corresponds to an independent Store within Region
- Store contains:
- MemStore: Write buffer, ordered storage based on skip list, default 16MB
- StoreFile (HFile): Disk persisted file, contains BloomFilter index blocks
- When Region size reaches threshold (default 10GB), automatically splits into two child Regions