Apache Kudu Overview

Apache Kudu is an open-source storage engine developed by Cloudera and contributed to Apache Software Foundation. It aims to solve a key problem in big data processing—how to simultaneously support low-latency random read/write and efficient analytical capabilities in the same storage system.

Core Features

  1. Hybrid Storage Model: Supports random read/write (like HBase) and batch scan analysis (like HDFS), typical read/write latency in millisecond range
  2. Distributed Architecture: Uses horizontally scalable architecture, employs Raft consistency protocol to ensure data reliability and consistency
  3. Columnar Storage: Uses columnar storage format, improves analytical query efficiency, supports compression and encoding optimization
  4. Ecosystem Integration: Deeply integrated with Apache Spark, Impala, Hive

Market Positioning

  • HDFS/Parquet/ORC: Suitable for batch analysis, but poor random write performance
  • HBase/Cassandra: Suitable for random access, but poor analytical efficiency
  • Kudu: Fills the gap between the two, suitable for real-time analysis, time-series data storage, HTAP scenarios

Performance

  • Single row read latency: <10ms
  • Batch scan throughput: GB/s level
  • Supports thousands of writes per second

Architecture

Kudu uses master-slave architecture:

  • Master Node: Responsible for global metadata management
  • Tablet Server Node: Responsible for table data storage and queries

Data Model

  • Table: Similar to relational database, needs to define primary key
  • Column: Supports multiple data types (integer, float, string, etc.)
  • Partition Strategy: Supports Range or Hash partitioning

Version Matrix (2025)

ComponentRecommended VersionNote
Kudu1.18.02025-07 release, new segmented LRU Block Cache
Spark Integrationkudu-spark3_2.12:1.18.0Connects with Spark 3.5
Flink IntegrationKudu Connector 2.0.0Supports Flink 1.19/1.20
Impala IntegrationImpala ≥3.3 + HMS syncLow-latency SQL read/write

Advantages

  • Low-latency random read/write performance
  • Efficient batch queries
  • Good integration with Spark, Impala
  • Flexible data model

Disadvantages

  • Limited transaction support
  • Not suitable for cold data
  • High memory dependency

Error Quick Reference

SymptomRoot Cause LocationFix
Write error NOT_THE_LEADERAccessed non-LeaderCheck logs, enable retry
Clock unsynchronizedNTP/Chrony not alignedUnify time source
Low read throughputToo many small TabletsControl partition count