Apache Kudu: Real-time Write + OLAP Architecture, Perform...

Apache Kudu Overview

Apache Kudu is an open-source storage engine developed by Cloudera and contributed to Apache Software Foundation. It aims to solve a key problem in big data processing—how to simultaneously support low-latency random read/write and efficient analytical capabilities in the same storage system.

Core Features

Hybrid Storage Model: Supports random read/write (like HBase) and batch scan analysis (like HDFS), typical read/write latency in millisecond range
Distributed Architecture: Uses horizontally scalable architecture, employs Raft consistency protocol to ensure data reliability and consistency
Columnar Storage: Uses columnar storage format, improves analytical query efficiency, supports compression and encoding optimization
Ecosystem Integration: Deeply integrated with Apache Spark, Impala, Hive

Market Positioning

HDFS/Parquet/ORC: Suitable for batch analysis, but poor random write performance
HBase/Cassandra: Suitable for random access, but poor analytical efficiency
Kudu: Fills the gap between the two, suitable for real-time analysis, time-series data storage, HTAP scenarios

Performance

Single row read latency: <10ms
Batch scan throughput: GB/s level
Supports thousands of writes per second

Architecture

Kudu uses master-slave architecture:

Master Node: Responsible for global metadata management
Tablet Server Node: Responsible for table data storage and queries

Data Model

Table: Similar to relational database, needs to define primary key
Column: Supports multiple data types (integer, float, string, etc.)
Partition Strategy: Supports Range or Hash partitioning

Version Matrix (2025)

Component	Recommended Version	Note
Kudu	1.18.0	2025-07 release, new segmented LRU Block Cache
Spark Integration	kudu-spark3_2.12:1.18.0	Connects with Spark 3.5
Flink Integration	Kudu Connector 2.0.0	Supports Flink 1.19/1.20
Impala Integration	Impala ≥3.3 + HMS sync	Low-latency SQL read/write

Advantages

Low-latency random read/write performance
Efficient batch queries
Good integration with Spark, Impala
Flexible data model

Disadvantages

Limited transaction support
Not suitable for cold data
High memory dependency

Error Quick Reference

Symptom	Root Cause Location	Fix
Write error NOT_THE_LEADER	Accessed non-Leader	Check logs, enable retry
Clock unsynchronized	NTP/Chrony not aligned	Unify time source
Low read throughput	Too many small Tablets	Control partition count