Big Data 138 - ClickHouse MergeTree Deep Dive: Partition Pruning × Sparse Primary Index × Marks × Compression

TL;DR

Scenario: Online table has severe read amplification under MergeTree, queries ‘scan full columns’, TTL not effective
Conclusion: Using combination of partition pruning + sparse primary index + mark granularity, with compression and OPTIMIZE, can reduce I/O for same queries to 10-30%
Output: Reproducible DDL/queries and diagnosis with marks/compression ratio/TTL verification method

Version Matrix

Component	Version/Type	Note
ClickHouse Server	24.x/25.x	.mrk2 common marker extension; default index_granularity = 8192
OS/FS	Ubuntu 22.04 / ext4	Example machine; different FS only affects I/O details
Client	clickhouse-client	Used for SQL and system table queries
Tool	clickhouse-compressor	View column compression statistics

MergeTree

Data Storage

ClickHouse is a columnar storage database, meaning each column’s data is stored separately, not like row-oriented databases storing each row as a whole. Columnar storage advantage is it can read only relevant columns for specific queries, greatly reducing I/O operations, especially performing well during aggregation or filtering operations.

Tables consist of data parts sorted by primary key. When data is inserted into table, it is divided into data parts and sorted by primary key dictionary order.

ClickHouse creates index file for each data part, index file includes primary key value for each index row, index row number defined as n * index_granularity.

Columnar Storage

In MergeTree, data is stored by column, specifically for each field column, has its own bin data file. Benefits of columnar storage:

Better compression
Minimize data scan range

MergeTree stores data into bin steps:

Compress data
Sort by ORDER BY
Data written to bin file in compressed blocks

Compressed Data Block

Compressed data block consists of two parts:

Header information (fixed 9 bytes)
Compressed data

Header format: CompressionMethod_CompressedSize_UncompressedSize

Can view compression situation using:

cd /var/lib/clickhouse/data/default/mt_table/202407_1_1_0
clickhouse-compressor --stat data.bin out.log

Compressed block generation rules (default 8192 index granularity):

If single batch data x < 64k, continue reading next batch, generate next block when size > 64k
If single batch data 64k < x < 1M, directly generate next block
If x > 1M, split data by 1M, remaining data continues according to above rules

Data Marks

In ClickHouse, mark is part of index, used to mark start position of data blocks in data file. Marks help quickly locate needed data blocks.

Mark contains:

Block start position
Min and max values for each column in block
Other metadata information

Mark granularity can be configured via index_granularity. Smaller mark granularity means larger mark file space, but better query performance.

.mrk file: Establishes mapping between index primary.idx and data file bin.

Data marks and index intervals are aligned, can find data mark based on index interval index number
Each [Column].bin has corresponding [Column].mrk
.mrk file records offset of data in bin file

Partition, Index, Mark and Compression Collaboration

Partition

ClickHouse’s partitioning mechanism is a storage strategy splitting large table data into independent logical segments. Each partition is equivalent to an independent subset of table.

Partition key definition:

Single column partition: PARTITION BY toYYYYMM(date_column)
Multi-column partition: PARTITION BY (toYYYYMM(date_column), city)
Expression partition: PARTITION BY sipHash64(user_id) % 4

Main advantages:

Query efficiency improvement - partition pruning
Data management convenience - TTL, delete, archive, migration
Time series processing

Index

ClickHouse’s index differs from traditional databases, mainly relies on primary key index and sparse index.

Primary key index: Determines data sorting order, assists data query, is a sparse index
Sparse index: Only marks certain rows, reduces storage overhead
Skip Indexes: minmax, set, bloom_filter, etc.

Marks

Marks are basis for sparse index implementation. During queries, marks are used to skip unneeded blocks, accelerating query process.

Compression Collaboration

ClickHouse provides multiple compression algorithms:

LZ4 (default): Fast, lightweight
ZSTD: High compression ratio
Delta, DoubleDelta: Designed for time-series data

Write Process

Generate partition directory
Merge partition directories
Generate primary.idx index file, each column’s bin and mrk files

Query Process

Narrow query range based on partition
Narrow query range based on data marks
Decompress data blocks

MergeTree’s TTL

TTL: time to live data survival time, can be set on table or column.

-- TTL set on column
CREATE TABLE ttl_table_v1 (
  id String,
  create_time DateTime,
  code String TTL create_time + INTERVAL 10 SECOND,
  type UInt8 TTL create_time + INTERVAL 10 SECOND
) ENGINE = MergeTree
PARTITION BY toYYYYMM(create_time)
ORDER BY id;

-- Insert data
INSERT INTO ttl_table_v1 VALUES
('A0000', now(), 'c1', 1),
('A0000', now() + INTERVAL 10 MINUTE, 'c1', 1);

-- Manually trigger merge
OPTIMIZE TABLE ttl_table_v1 FINAL;

-- TTL set on table
CREATE TABLE tt1_table_v2 (
  id String,
  create_time DateTime,
  code String,
  type UInt8
) ENGINE = MergeTree
PARTITION BY toYYYYMM(create_time)
ORDER BY create_time
TTL create_time + INTERVAL 1 DAY;

-- Modify TTL
ALTER TABLE tt1_table_v1 MODIFY TTL create_time + INTERVAL + 3 DAY;

Reproducible Simple Script

-- 1) Create table: monthly partition + typical primary key
CREATE TABLE mt_demo (
  counter_id UInt32,
  dt DateTime,
  city LowCardinality(String),
  v Float64
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(dt)
ORDER BY (counter_id, dt)
SETTINGS index_granularity = 8192;

-- 2) Generate data: 10 million rows
INSERT INTO mt_demo
SELECT
  randUniform(1, 50000) AS counter_id,
  now() - randUniform(0, 86400*90) AS dt,
  concat('c', toString(randUniform(1, 5000))) AS city,
  randCanonical() AS v
FROM numbers(10000000);

-- 3) Typical query: pruning + skip reading
SELECT avg(v)
FROM mt_demo
WHERE counter_id = 42 AND dt BETWEEN now()-86400 AND now();

-- 4) Observe system tables: partition/part/marks/compression
SELECT partition, name, rows, bytes_on_disk
FROM system.parts WHERE table = 'mt_demo' AND active;

SELECT partition, name, column, marks, data_compressed_bytes
FROM system.parts_columns WHERE table = 'mt_demo' AND active
ORDER BY partition, name, column;

Partition Strategy and Maintenance

-- View partition/row count/space
SELECT partition, rows, disk_size FROM system.parts WHERE table='mt_demo' AND active;

-- Single partition optimize
OPTIMIZE TABLE mt_demo PARTITION 202510 FINAL;

-- Move/delete/detach
ALTER TABLE mt_demo DROP PARTITION 202409;
ALTER TABLE mt_demo DETACH PARTITION 202409;

Error Quick Reference

Symptom	Possible Root Cause	Quick Locating Method	Handling Solution
WHERE hits but almost full table scan	Primary key doesn’t include filter column / granularity too large	EXPLAIN AST/PIPELINE, view read_rows	Adjust primary key and index_granularity, or add skip index
TTL not effective	Data not due / merge not triggered	Create data with ‘past time’ + OPTIMIZE FINAL	Wait for background merge or manually trigger
Disk spikes	Too many small partitions/parts	system.parts view part count	Adjust partition granularity, accelerate merge