Apache Kudu Architecture & Practice: RowSet, Partition & ...

TL;DR

Scenario: In real-time/near-real-time data warehouse, need low-latency writes + parallel columnar analysis reads.
Conclusion: Kudu uses RowSet design, Range/Hash composite partitioning and Raft replicas to balance throughput and consistency.
Output: Architecture key points comparison, common error quick reference and fix paths.

Kudu Comparison

Comparison between Kudu, HBase, and HDFS:

Feature	HDFS	HBase	Kudu
Random R/W	Poor	Strong	Strong
Batch Analysis	Strong	Poor	Strong
Latency	High	Low	Low

Kudu Architecture

Similar to HDFS and HDFS, Kudu uses a single Master node to manage cluster metadata, and any number of TabletServer nodes to store actual data. Multiple Master nodes can be deployed for fault tolerance.

Master

Kudu’s Master node is responsible for entire cluster metadata management and service coordination, with following functions:

Catalog Manager: Manages all Tablets, Schema and other metadata in the cluster.
Cluster Coordinate: Tracks all Server node liveness, coordinates data redistribution when Server nodes fail.
Tablet Directory: Tracks location of each Tablet.

Catalog Manager

Master node holds a single Tablet Table-CatalogTable, which saves all Tablets’ Schema versions and Table states (creating, running, deleting, etc.).

Cluster Coordination

Each Tablet Server in Kudu cluster needs to configure Master hostname list. During cluster startup, TabletServer registers with Master and sends all Tablet information.

Tablet Directory

Since Master caches cluster metadata, when Client reads/writes data, it obtains Tablet location through Master. Client caches Tablet location locally, avoiding reading from Master every time.

Table

For data storage, Kudu implements storage entirely on its own. Tablet storage main goals:

Fast column scan
Low-latency random read/write
Consistent performance

RowSets

In Apache Kudu’s storage architecture, Tablets are further divided into smaller storage units called RowSets. These RowSets are divided into two types based on storage medium:

MemRowSets:
- Fully in-memory data structure
- B-tree based implementation
- Stores latest inserted data
- Uses efficient concurrency control mechanism supporting high-throughput writes
DiskRowSets:
- Hybrid memory and disk storage structure
- Converted from MemRowSet through Flush operation
- Stored on disk in columnar storage format (CFile)
- Includes auxiliary structures like Bloom Filter and primary key index

Storage Management:

Data uniqueness: Any row data not logically deleted exists strictly in only one RowSet
MemRowSet lifecycle: Each Tablet maintains only one active MemRowSet. Background Flush thread triggers data persistence based on configured time interval (default 1 second) or size threshold

Flush Operation Key Features:

Non-blocking design: Uses Copy-on-Write to ensure Flush doesn’t affect client read/write during process
Parallel processing: A large MemRowSet is split into multiple DiskRowSets to improve parallelism
Data transformation: Original row data converted to columnar storage format, with corresponding index structures generated

MemRowSets

MemRowSets is a concurrently accessible and optimized B-Tree, mainly designed based on MassTree, but with several differences:

Kudu doesn’t support direct delete operation. Due to MVCC, delete operation in Kudu is actually inserting a record marking deletion.
Similar to delete, Kudu doesn’t support in-place update operations
Link Tree leaves together, like B+Tree. This key operation can significantly improve Scan operation performance.
Didn’t implement trie tree, just uses single Tree.

DiskRowSet

When MemRowSet is flushed to disk, it becomes DiskRowSet. When MemRowSet is flushed to disk, a new DiskRowSet is formed every 32M. Kudu divides data into BaseData and DeltaData to achieve data updates. Kudu stores data by column, data is split into multiple Pages, and uses B-Tree for indexing. Besides user data, Kudu also stores primary key index in a column and provides Bloom Filter for efficient lookup.

Compaction

To provide query performance, Kudu periodically performs Compaction operations, merges DeltaData with BaseData, deletes data marked for deletion, and merges some DiskRowSets.

Partitioning

Choosing partition strategy requires understanding data model and expected workload:

For high write volume workloads, important to design partitions so writes are distributed across Tablets to avoid single Tablet overload.
For workloads involving many short scans, performance can be improved if all scanned data is on same block Tablet.

Kudu supports two partitioning strategies:

Range Partitioning

Range partitioning divides Tablets based on primary key range. Users can manually or automatically distribute data to different Tablets by setting partition key range. For example, partitioning timestamp data table can distribute data from different time periods to different partitions.

Hash Partitioning

Hash partitioning hashes primary keys to evenly distribute data across Tablets. Hash partitioning is suitable for scenarios without obvious range conditions in queries, like primary key queries or random access scenarios.

Query and Write Process

Write Process

When Client writes data to Tablet Server, it’s first written to MemRowSet.
When MemRowSet reaches capacity, data is flushed to disk (DiskRowSet) via background thread.
During flush, Tablet Server syncs data to other replicas to ensure data consistency.

Query Process

When Client queries data, query request first sent to Master node. Master locates corresponding Tablet Server based on requested primary key range.
Tablet Server reads corresponding column data from DiskRowSet on disk, returns to Client.
If query involves only some columns, Kudu reads only involved column data, leveraging columnar storage advantage to improve query efficiency.

Tablet and Raft Consensus Protocol

Kudu data is split into multiple Tablets. Each Tablet is part of the table, similar to horizontal partitioning.

Each Tablet is divided by primary key range, ensuring even distribution across different Tablet Servers.

To ensure data reliability and consistency, Kudu uses Raft consensus protocol for replica management. Each Tablet usually has multiple replicas (default three), these replicas sync through Raft protocol.

Raft Protocol Guarantees

Data Consistency: At any moment, only one replica can be the primary (Leader), others are Followers. All write operations must first be written to Leader, then sync to Followers via Raft protocol, ensuring data consistency.
Fault Tolerance: If Leader replica fails, system automatically elects new Leader via Raft protocol, ensuring high availability.

Error Quick Reference

Symptom	Root Cause	Location	Fix
Occasional write timeout/retry	Partition hotspot or too few Tablets	Observe write distribution, Tablet Server write queue and RPC metrics	Increase Hash partition buckets or introduce multi-level partition, balance writes
Scan jitter, low cache hit rate	Large scans evict hot pages	Monitor Block Cache hit rate/eviction rate	Evaluate “segmented LRU Block Cache” in 1.18.0, or expand cache/tiered Scan
Tablet frequent leader election/temporarily unwritable	Raft majority unavailable or network jitter	Check replica health, heartbeat and timeout	Recover replica/network, if necessary lower replicas to available majority and rebuild
Leader write success but read inconsistent	Read lagging Follower or delayed visibility	Compare Leader/Follower logs and MVCC水位	Route reads to Leader/enable consistent read, confirm majority commit point
Table capacity/Region imbalance	Only Range partition causing skew	View partition key cardinality and time-sequential writes	Introduce Hash dimension or time-rolling Range; regularly add partitions
Flush frequent affecting tail latency	MemRowSet too small or write burst	Monitor Flush frequency, disk IO, queue	Adjust Flush threshold and WAL/IO concurrency; evaluate hardware bandwidth
Client metadata query amplification	Frequent Master queries to locate Tablet	Client cache hit and miss rate	Increase client Tablet location cache time; investigate Tablet migration frequency