This is article 65 in the Big Data series, deeply analyzing Kafka’s log storage mechanism. Understanding underlying storage principles helps with targeted performance tuning and troubleshooting.
Log Storage Architecture
Each Kafka partition corresponds to a directory on disk, directory naming is <topic>-<partition>. Partition directory stores multiple LogSegments, each LogSegment consists of a group of files:
/kafka-logs/my-topic-0/
├── 00000000000000000000.log # Data file (message content)
├── 00000000000000000000.index # Offset index file
├── 00000000000000000000.timeindex # Timestamp index file
├── 00000000000000000500.log
├── 00000000000000000500.index
├── 00000000000000000500.timeindex
└── ...
Filename is 20-digit zero-padded decimal number, representing the absolute offset of the first message in that Segment. Currently actively written Segment (activeSegment) is always the last one.
LogSegment Segmentation Design
Benefits of splitting log into multiple Segments:
- Easy cleanup: Expired data deleted at Segment granularity, doesn’t affect other data
- Reduced search scope: Filename quickly locates message’s Segment
- Control single file size: Avoids IO performance issues from overly large single files
Segmentation Triggers
Kafka rolls to create new Segment when any condition is met:
| Parameter | Default | Meaning |
|---|---|---|
log.segment.bytes | 1073741824 (1GB) | Segment file size limit |
log.roll.hours | 168 (7 days) | Segment max lifetime |
log.index.size.max.bytes | 10485760 (10MB) | Index file size limit |
Index Mechanism
Kafka uses two types of indexes to accelerate message lookup, both using sparse index design - not building index for each message, but building one index entry every certain amount of data (controlled by log.index.interval.bytes, default 4096 bytes).
Offset Index (.index)
Index entry stores two values:
- Relative Offset: Message offset minus Segment starting offset, stored in 4 bytes (saves space)
- Position: Message byte offset in
.logfile, stored in 4 bytes
Example with Segment starting offset 500:
Index entry: [relative offset=10, position=1024]
Means: Message at absolute offset 510 is at byte position 1024 in .log file
Index file uses mmap (memory-mapped file) to load into memory, making index lookup speed approach memory access speed.
Timestamp Index (.timeindex)
Each timestamp index record contains:
- Timestamp: Latest message timestamp written at that time in that Segment
- Relative Offset: Corresponding message’s relative offset
Used to support time-range message lookup (e.g., consumer resetting offset to certain time point).
Lookup Flow Example
Looking up message at absolute offset 368776:
- Locate Segment: Binary search all Segment filenames, find Segment with starting offset ≤ 368776 and largest (e.g.,
00000000000368000.log) - Search index: Binary search in that Segment’s
.indexfile, find largest index entry not exceeding relative offset 776, assume get positionpos=8192 - Sequential scan: Start sequential read from
.logfile atpos=8192, until find message at offset 368776
Cost of sparse index is step 3 needs small amount of sequential scan, but because log.index.interval.bytes controls spacing, scan range is very small, actual impact negligible.
Log Write Mechanism
Data is only append-written to activeSegment (sequential write). This is one foundation of Kafka’s high throughput - sequential disk IO performance approaches random memory access.
Write flow:
1. Message appended to activeSegment's .log file end
2. Every log.index.interval.bytes, append one index entry to .index file
3. When activeSegment reaches roll condition, create new activeSegment
Kafka also uses OS Page Cache for delayed flushing, controlled by log.flush.interval.messages and log.flush.interval.ms, balancing performance and durability.
Message Retention and Cleanup Strategy
Time-based Retention
# Message retention duration (priority: ms > minutes > hours)
log.retention.ms=604800000 # 7 days (milliseconds)
log.retention.minutes=10080 # 7 days (minutes)
log.retention.hours=168 # 7 days (hours, default)
Size-based Retention
# Single partition log total size limit (-1 means unlimited)
log.retention.bytes=-1
Note: log.retention.bytes is partition-level limit, not entire Topic.
Cleanup Granularity
Kafka uses Segment as minimum cleanup unit, doesn’t delete individual messages. Only when all messages in entire Segment are expired (latest message timestamp exceeds retention duration) is that Segment deleted.
This means actual storage may slightly exceed configured retention size, at most by one Segment size.
Log Compaction
Besides deletion strategy, Kafka also supports log compaction:
log.cleanup.policy=compact
Compaction strategy keeps latest message for each Key, deletes old versions of same Key. Suitable for scenarios needing latest state (e.g., database change logs, user configuration storage).
Key Configuration Summary
| Parameter | Default | Description |
|---|---|---|
log.segment.bytes | 1GB | Segment roll size threshold |
log.roll.hours | 168h | Segment max lifetime |
log.index.interval.bytes | 4096B | Index sparsity (smaller = denser) |
log.retention.hours | 168h | Message retention duration |
log.retention.bytes | -1 | Partition max storage (-1 = unlimited) |
log.cleanup.policy | delete | Cleanup strategy (delete/compact) |
log.flush.interval.messages | Long.MAX_VALUE | Message count trigger for flush |
Kafka’s storage design fully utilizes sequential write, sparse index and mmap technologies. While ensuring high throughput, it also achieves flexible message retrieval and lifecycle management. Understanding these mechanisms is foundation for Kafka cluster tuning and capacity planning.