TL;DR
- Scenario: In real-time/near-real-time data warehouse, need low-latency writes + parallel columnar analysis reads.
- Conclusion: Kudu uses RowSet design, Range/Hash composite partitioning and Raft replicas to balance throughput and consistency.
- Output: Architecture key points comparison, common error quick reference and fix paths.
Kudu Comparison
Comparison between Kudu, HBase, and HDFS:
| Feature | HDFS | HBase | Kudu |
|---|---|---|---|
| Random R/W | Poor | Strong | Strong |
| Batch Analysis | Strong | Poor | Strong |
| Latency | High | Low | Low |
Kudu Architecture
Similar to HDFS and HDFS, Kudu uses a single Master node to manage cluster metadata, and any number of TabletServer nodes to store actual data. Multiple Master nodes can be deployed for fault tolerance.
Master
Kudu’s Master node is responsible for entire cluster metadata management and service coordination, with following functions:
- Catalog Manager: Manages all Tablets, Schema and other metadata in the cluster.
- Cluster Coordinate: Tracks all Server node liveness, coordinates data redistribution when Server nodes fail.
- Tablet Directory: Tracks location of each Tablet.
Catalog Manager
Master node holds a single Tablet Table-CatalogTable, which saves all Tablets’ Schema versions and Table states (creating, running, deleting, etc.).
Cluster Coordination
Each Tablet Server in Kudu cluster needs to configure Master hostname list. During cluster startup, TabletServer registers with Master and sends all Tablet information.
Tablet Directory
Since Master caches cluster metadata, when Client reads/writes data, it obtains Tablet location through Master. Client caches Tablet location locally, avoiding reading from Master every time.
Table
For data storage, Kudu implements storage entirely on its own. Tablet storage main goals:
- Fast column scan
- Low-latency random read/write
- Consistent performance
RowSets
In Apache Kudu’s storage architecture, Tablets are further divided into smaller storage units called RowSets. These RowSets are divided into two types based on storage medium:
-
MemRowSets:
- Fully in-memory data structure
- B-tree based implementation
- Stores latest inserted data
- Uses efficient concurrency control mechanism supporting high-throughput writes
-
DiskRowSets:
- Hybrid memory and disk storage structure
- Converted from MemRowSet through Flush operation
- Stored on disk in columnar storage format (CFile)
- Includes auxiliary structures like Bloom Filter and primary key index
Storage Management:
- Data uniqueness: Any row data not logically deleted exists strictly in only one RowSet
- MemRowSet lifecycle: Each Tablet maintains only one active MemRowSet. Background Flush thread triggers data persistence based on configured time interval (default 1 second) or size threshold
Flush Operation Key Features:
- Non-blocking design: Uses Copy-on-Write to ensure Flush doesn’t affect client read/write during process
- Parallel processing: A large MemRowSet is split into multiple DiskRowSets to improve parallelism
- Data transformation: Original row data converted to columnar storage format, with corresponding index structures generated
MemRowSets
MemRowSets is a concurrently accessible and optimized B-Tree, mainly designed based on MassTree, but with several differences:
- Kudu doesn’t support direct delete operation. Due to MVCC, delete operation in Kudu is actually inserting a record marking deletion.
- Similar to delete, Kudu doesn’t support in-place update operations
- Link Tree leaves together, like B+Tree. This key operation can significantly improve Scan operation performance.
- Didn’t implement trie tree, just uses single Tree.
DiskRowSet
When MemRowSet is flushed to disk, it becomes DiskRowSet. When MemRowSet is flushed to disk, a new DiskRowSet is formed every 32M. Kudu divides data into BaseData and DeltaData to achieve data updates. Kudu stores data by column, data is split into multiple Pages, and uses B-Tree for indexing. Besides user data, Kudu also stores primary key index in a column and provides Bloom Filter for efficient lookup.
Compaction
To provide query performance, Kudu periodically performs Compaction operations, merges DeltaData with BaseData, deletes data marked for deletion, and merges some DiskRowSets.
Partitioning
Choosing partition strategy requires understanding data model and expected workload:
- For high write volume workloads, important to design partitions so writes are distributed across Tablets to avoid single Tablet overload.
- For workloads involving many short scans, performance can be improved if all scanned data is on same block Tablet.
Kudu supports two partitioning strategies:
Range Partitioning
Range partitioning divides Tablets based on primary key range. Users can manually or automatically distribute data to different Tablets by setting partition key range. For example, partitioning timestamp data table can distribute data from different time periods to different partitions.
Hash Partitioning
Hash partitioning hashes primary keys to evenly distribute data across Tablets. Hash partitioning is suitable for scenarios without obvious range conditions in queries, like primary key queries or random access scenarios.
Query and Write Process
Write Process
- When Client writes data to Tablet Server, it’s first written to MemRowSet.
- When MemRowSet reaches capacity, data is flushed to disk (DiskRowSet) via background thread.
- During flush, Tablet Server syncs data to other replicas to ensure data consistency.
Query Process
- When Client queries data, query request first sent to Master node. Master locates corresponding Tablet Server based on requested primary key range.
- Tablet Server reads corresponding column data from DiskRowSet on disk, returns to Client.
- If query involves only some columns, Kudu reads only involved column data, leveraging columnar storage advantage to improve query efficiency.
Tablet and Raft Consensus Protocol
Kudu data is split into multiple Tablets. Each Tablet is part of the table, similar to horizontal partitioning.
Each Tablet is divided by primary key range, ensuring even distribution across different Tablet Servers.
To ensure data reliability and consistency, Kudu uses Raft consensus protocol for replica management. Each Tablet usually has multiple replicas (default three), these replicas sync through Raft protocol.
Raft Protocol Guarantees
- Data Consistency: At any moment, only one replica can be the primary (Leader), others are Followers. All write operations must first be written to Leader, then sync to Followers via Raft protocol, ensuring data consistency.
- Fault Tolerance: If Leader replica fails, system automatically elects new Leader via Raft protocol, ensuring high availability.
Error Quick Reference
| Symptom | Root Cause | Location | Fix |
|---|---|---|---|
| Occasional write timeout/retry | Partition hotspot or too few Tablets | Observe write distribution, Tablet Server write queue and RPC metrics | Increase Hash partition buckets or introduce multi-level partition, balance writes |
| Scan jitter, low cache hit rate | Large scans evict hot pages | Monitor Block Cache hit rate/eviction rate | Evaluate “segmented LRU Block Cache” in 1.18.0, or expand cache/tiered Scan |
| Tablet frequent leader election/temporarily unwritable | Raft majority unavailable or network jitter | Check replica health, heartbeat and timeout | Recover replica/network, if necessary lower replicas to available majority and rebuild |
| Leader write success but read inconsistent | Read lagging Follower or delayed visibility | Compare Leader/Follower logs and MVCC水位 | Route reads to Leader/enable consistent read, confirm majority commit point |
| Table capacity/Region imbalance | Only Range partition causing skew | View partition key cardinality and time-sequential writes | Introduce Hash dimension or time-rolling Range; regularly add partitions |
| Flush frequent affecting tail latency | MemRowSet too small or write burst | Monitor Flush frequency, disk IO, queue | Adjust Flush threshold and WAL/IO concurrency; evaluate hardware bandwidth |
| Client metadata query amplification | Frequent Master queries to locate Tablet | Client cache hit and miss rate | Increase client Tablet location cache time; investigate Tablet migration frequency |