Apache Kudu Overview
Apache Kudu is an open-source storage engine developed by Cloudera and contributed to Apache Software Foundation. It aims to solve a key problem in big data processing—how to simultaneously support low-latency random read/write and efficient analytical capabilities in the same storage system.
Core Features
- Hybrid Storage Model: Supports random read/write (like HBase) and batch scan analysis (like HDFS), typical read/write latency in millisecond range
- Distributed Architecture: Uses horizontally scalable architecture, employs Raft consistency protocol to ensure data reliability and consistency
- Columnar Storage: Uses columnar storage format, improves analytical query efficiency, supports compression and encoding optimization
- Ecosystem Integration: Deeply integrated with Apache Spark, Impala, Hive
Market Positioning
- HDFS/Parquet/ORC: Suitable for batch analysis, but poor random write performance
- HBase/Cassandra: Suitable for random access, but poor analytical efficiency
- Kudu: Fills the gap between the two, suitable for real-time analysis, time-series data storage, HTAP scenarios
Performance
- Single row read latency: <10ms
- Batch scan throughput: GB/s level
- Supports thousands of writes per second
Architecture
Kudu uses master-slave architecture:
- Master Node: Responsible for global metadata management
- Tablet Server Node: Responsible for table data storage and queries
Data Model
- Table: Similar to relational database, needs to define primary key
- Column: Supports multiple data types (integer, float, string, etc.)
- Partition Strategy: Supports Range or Hash partitioning
Version Matrix (2025)
| Component | Recommended Version | Note |
|---|---|---|
| Kudu | 1.18.0 | 2025-07 release, new segmented LRU Block Cache |
| Spark Integration | kudu-spark3_2.12:1.18.0 | Connects with Spark 3.5 |
| Flink Integration | Kudu Connector 2.0.0 | Supports Flink 1.19/1.20 |
| Impala Integration | Impala ≥3.3 + HMS sync | Low-latency SQL read/write |
Advantages
- Low-latency random read/write performance
- Efficient batch queries
- Good integration with Spark, Impala
- Flexible data model
Disadvantages
- Limited transaction support
- Not suitable for cold data
- High memory dependency
Error Quick Reference
| Symptom | Root Cause Location | Fix |
|---|---|---|
| Write error NOT_THE_LEADER | Accessed non-Leader | Check logs, enable retry |
| Clock unsynchronized | NTP/Chrony not aligned | Unify time source |
| Low read throughput | Too many small Tablets | Control partition count |