TL;DR

  • Scenario: In real-time/near-real-time data warehouse, need low-latency writes + parallel columnar analysis reads.
  • Conclusion: Kudu uses RowSet design, Range/Hash composite partitioning and Raft replicas to balance throughput and consistency.
  • Output: Architecture key points comparison, common error quick reference and fix paths.

Kudu Comparison

Comparison between Kudu, HBase, and HDFS:

FeatureHDFSHBaseKudu
Random R/WPoorStrongStrong
Batch AnalysisStrongPoorStrong
LatencyHighLowLow

Kudu Architecture

Similar to HDFS and HDFS, Kudu uses a single Master node to manage cluster metadata, and any number of TabletServer nodes to store actual data. Multiple Master nodes can be deployed for fault tolerance.

Master

Kudu’s Master node is responsible for entire cluster metadata management and service coordination, with following functions:

  • Catalog Manager: Manages all Tablets, Schema and other metadata in the cluster.
  • Cluster Coordinate: Tracks all Server node liveness, coordinates data redistribution when Server nodes fail.
  • Tablet Directory: Tracks location of each Tablet.

Catalog Manager

Master node holds a single Tablet Table-CatalogTable, which saves all Tablets’ Schema versions and Table states (creating, running, deleting, etc.).

Cluster Coordination

Each Tablet Server in Kudu cluster needs to configure Master hostname list. During cluster startup, TabletServer registers with Master and sends all Tablet information.

Tablet Directory

Since Master caches cluster metadata, when Client reads/writes data, it obtains Tablet location through Master. Client caches Tablet location locally, avoiding reading from Master every time.

Table

For data storage, Kudu implements storage entirely on its own. Tablet storage main goals:

  • Fast column scan
  • Low-latency random read/write
  • Consistent performance

RowSets

In Apache Kudu’s storage architecture, Tablets are further divided into smaller storage units called RowSets. These RowSets are divided into two types based on storage medium:

  1. MemRowSets:

    • Fully in-memory data structure
    • B-tree based implementation
    • Stores latest inserted data
    • Uses efficient concurrency control mechanism supporting high-throughput writes
  2. DiskRowSets:

    • Hybrid memory and disk storage structure
    • Converted from MemRowSet through Flush operation
    • Stored on disk in columnar storage format (CFile)
    • Includes auxiliary structures like Bloom Filter and primary key index

Storage Management:

  • Data uniqueness: Any row data not logically deleted exists strictly in only one RowSet
  • MemRowSet lifecycle: Each Tablet maintains only one active MemRowSet. Background Flush thread triggers data persistence based on configured time interval (default 1 second) or size threshold

Flush Operation Key Features:

  • Non-blocking design: Uses Copy-on-Write to ensure Flush doesn’t affect client read/write during process
  • Parallel processing: A large MemRowSet is split into multiple DiskRowSets to improve parallelism
  • Data transformation: Original row data converted to columnar storage format, with corresponding index structures generated

MemRowSets

MemRowSets is a concurrently accessible and optimized B-Tree, mainly designed based on MassTree, but with several differences:

  • Kudu doesn’t support direct delete operation. Due to MVCC, delete operation in Kudu is actually inserting a record marking deletion.
  • Similar to delete, Kudu doesn’t support in-place update operations
  • Link Tree leaves together, like B+Tree. This key operation can significantly improve Scan operation performance.
  • Didn’t implement trie tree, just uses single Tree.

DiskRowSet

When MemRowSet is flushed to disk, it becomes DiskRowSet. When MemRowSet is flushed to disk, a new DiskRowSet is formed every 32M. Kudu divides data into BaseData and DeltaData to achieve data updates. Kudu stores data by column, data is split into multiple Pages, and uses B-Tree for indexing. Besides user data, Kudu also stores primary key index in a column and provides Bloom Filter for efficient lookup.

Compaction

To provide query performance, Kudu periodically performs Compaction operations, merges DeltaData with BaseData, deletes data marked for deletion, and merges some DiskRowSets.


Partitioning

Choosing partition strategy requires understanding data model and expected workload:

  • For high write volume workloads, important to design partitions so writes are distributed across Tablets to avoid single Tablet overload.
  • For workloads involving many short scans, performance can be improved if all scanned data is on same block Tablet.

Kudu supports two partitioning strategies:

Range Partitioning

Range partitioning divides Tablets based on primary key range. Users can manually or automatically distribute data to different Tablets by setting partition key range. For example, partitioning timestamp data table can distribute data from different time periods to different partitions.

Hash Partitioning

Hash partitioning hashes primary keys to evenly distribute data across Tablets. Hash partitioning is suitable for scenarios without obvious range conditions in queries, like primary key queries or random access scenarios.


Query and Write Process

Write Process

  • When Client writes data to Tablet Server, it’s first written to MemRowSet.
  • When MemRowSet reaches capacity, data is flushed to disk (DiskRowSet) via background thread.
  • During flush, Tablet Server syncs data to other replicas to ensure data consistency.

Query Process

  • When Client queries data, query request first sent to Master node. Master locates corresponding Tablet Server based on requested primary key range.
  • Tablet Server reads corresponding column data from DiskRowSet on disk, returns to Client.
  • If query involves only some columns, Kudu reads only involved column data, leveraging columnar storage advantage to improve query efficiency.

Tablet and Raft Consensus Protocol

Kudu data is split into multiple Tablets. Each Tablet is part of the table, similar to horizontal partitioning.

Each Tablet is divided by primary key range, ensuring even distribution across different Tablet Servers.

To ensure data reliability and consistency, Kudu uses Raft consensus protocol for replica management. Each Tablet usually has multiple replicas (default three), these replicas sync through Raft protocol.

Raft Protocol Guarantees

  • Data Consistency: At any moment, only one replica can be the primary (Leader), others are Followers. All write operations must first be written to Leader, then sync to Followers via Raft protocol, ensuring data consistency.
  • Fault Tolerance: If Leader replica fails, system automatically elects new Leader via Raft protocol, ensuring high availability.

Error Quick Reference

SymptomRoot CauseLocationFix
Occasional write timeout/retryPartition hotspot or too few TabletsObserve write distribution, Tablet Server write queue and RPC metricsIncrease Hash partition buckets or introduce multi-level partition, balance writes
Scan jitter, low cache hit rateLarge scans evict hot pagesMonitor Block Cache hit rate/eviction rateEvaluate “segmented LRU Block Cache” in 1.18.0, or expand cache/tiered Scan
Tablet frequent leader election/temporarily unwritableRaft majority unavailable or network jitterCheck replica health, heartbeat and timeoutRecover replica/network, if necessary lower replicas to available majority and rebuild
Leader write success but read inconsistentRead lagging Follower or delayed visibilityCompare Leader/Follower logs and MVCC水位Route reads to Leader/enable consistent read, confirm majority commit point
Table capacity/Region imbalanceOnly Range partition causing skewView partition key cardinality and time-sequential writesIntroduce Hash dimension or time-rolling Range; regularly add partitions
Flush frequent affecting tail latencyMemRowSet too small or write burstMonitor Flush frequency, disk IO, queueAdjust Flush threshold and WAL/IO concurrency; evaluate hardware bandwidth
Client metadata query amplificationFrequent Master queries to locate TabletClient cache hit and miss rateIncrease client Tablet location cache time; investigate Tablet migration frequency