This is article 65 in the Big Data series, deeply analyzing Kafka’s log storage mechanism. Understanding underlying storage principles helps with targeted performance tuning and troubleshooting.

Log Storage Architecture

Each Kafka partition corresponds to a directory on disk, directory naming is <topic>-<partition>. Partition directory stores multiple LogSegments, each LogSegment consists of a group of files:

/kafka-logs/my-topic-0/
├── 00000000000000000000.log        # Data file (message content)
├── 00000000000000000000.index      # Offset index file
├── 00000000000000000000.timeindex  # Timestamp index file
├── 00000000000000000500.log
├── 00000000000000000500.index
├── 00000000000000000500.timeindex
└── ...

Filename is 20-digit zero-padded decimal number, representing the absolute offset of the first message in that Segment. Currently actively written Segment (activeSegment) is always the last one.

LogSegment Segmentation Design

Benefits of splitting log into multiple Segments:

  • Easy cleanup: Expired data deleted at Segment granularity, doesn’t affect other data
  • Reduced search scope: Filename quickly locates message’s Segment
  • Control single file size: Avoids IO performance issues from overly large single files

Segmentation Triggers

Kafka rolls to create new Segment when any condition is met:

ParameterDefaultMeaning
log.segment.bytes1073741824 (1GB)Segment file size limit
log.roll.hours168 (7 days)Segment max lifetime
log.index.size.max.bytes10485760 (10MB)Index file size limit

Index Mechanism

Kafka uses two types of indexes to accelerate message lookup, both using sparse index design - not building index for each message, but building one index entry every certain amount of data (controlled by log.index.interval.bytes, default 4096 bytes).

Offset Index (.index)

Index entry stores two values:

  • Relative Offset: Message offset minus Segment starting offset, stored in 4 bytes (saves space)
  • Position: Message byte offset in .log file, stored in 4 bytes
Example with Segment starting offset 500:
Index entry: [relative offset=10, position=1024]
Means: Message at absolute offset 510 is at byte position 1024 in .log file

Index file uses mmap (memory-mapped file) to load into memory, making index lookup speed approach memory access speed.

Timestamp Index (.timeindex)

Each timestamp index record contains:

  • Timestamp: Latest message timestamp written at that time in that Segment
  • Relative Offset: Corresponding message’s relative offset

Used to support time-range message lookup (e.g., consumer resetting offset to certain time point).

Lookup Flow Example

Looking up message at absolute offset 368776:

  1. Locate Segment: Binary search all Segment filenames, find Segment with starting offset ≤ 368776 and largest (e.g., 00000000000368000.log)
  2. Search index: Binary search in that Segment’s .index file, find largest index entry not exceeding relative offset 776, assume get position pos=8192
  3. Sequential scan: Start sequential read from .log file at pos=8192, until find message at offset 368776

Cost of sparse index is step 3 needs small amount of sequential scan, but because log.index.interval.bytes controls spacing, scan range is very small, actual impact negligible.

Log Write Mechanism

Data is only append-written to activeSegment (sequential write). This is one foundation of Kafka’s high throughput - sequential disk IO performance approaches random memory access.

Write flow:
1. Message appended to activeSegment's .log file end
2. Every log.index.interval.bytes, append one index entry to .index file
3. When activeSegment reaches roll condition, create new activeSegment

Kafka also uses OS Page Cache for delayed flushing, controlled by log.flush.interval.messages and log.flush.interval.ms, balancing performance and durability.

Message Retention and Cleanup Strategy

Time-based Retention

# Message retention duration (priority: ms > minutes > hours)
log.retention.ms=604800000        # 7 days (milliseconds)
log.retention.minutes=10080       # 7 days (minutes)
log.retention.hours=168           # 7 days (hours, default)

Size-based Retention

# Single partition log total size limit (-1 means unlimited)
log.retention.bytes=-1

Note: log.retention.bytes is partition-level limit, not entire Topic.

Cleanup Granularity

Kafka uses Segment as minimum cleanup unit, doesn’t delete individual messages. Only when all messages in entire Segment are expired (latest message timestamp exceeds retention duration) is that Segment deleted.

This means actual storage may slightly exceed configured retention size, at most by one Segment size.

Log Compaction

Besides deletion strategy, Kafka also supports log compaction:

log.cleanup.policy=compact

Compaction strategy keeps latest message for each Key, deletes old versions of same Key. Suitable for scenarios needing latest state (e.g., database change logs, user configuration storage).

Key Configuration Summary

ParameterDefaultDescription
log.segment.bytes1GBSegment roll size threshold
log.roll.hours168hSegment max lifetime
log.index.interval.bytes4096BIndex sparsity (smaller = denser)
log.retention.hours168hMessage retention duration
log.retention.bytes-1Partition max storage (-1 = unlimited)
log.cleanup.policydeleteCleanup strategy (delete/compact)
log.flush.interval.messagesLong.MAX_VALUEMessage count trigger for flush

Kafka’s storage design fully utilizes sequential write, sparse index and mmap technologies. While ensuring high throughput, it also achieves flexible message retrieval and lifecycle management. Understanding these mechanisms is foundation for Kafka cluster tuning and capacity planning.