Big Data 138 - ClickHouse MergeTree Deep Dive: Partition Pruning × Sparse Primary Index × Marks × Compression
TL;DR
- Scenario: Online table has severe read amplification under MergeTree, queries ‘scan full columns’, TTL not effective
- Conclusion: Using combination of partition pruning + sparse primary index + mark granularity, with compression and OPTIMIZE, can reduce I/O for same queries to 10-30%
- Output: Reproducible DDL/queries and diagnosis with marks/compression ratio/TTL verification method
Version Matrix
| Component | Version/Type | Note |
|---|---|---|
| ClickHouse Server | 24.x/25.x | .mrk2 common marker extension; default index_granularity = 8192 |
| OS/FS | Ubuntu 22.04 / ext4 | Example machine; different FS only affects I/O details |
| Client | clickhouse-client | Used for SQL and system table queries |
| Tool | clickhouse-compressor | View column compression statistics |
MergeTree
Data Storage
ClickHouse is a columnar storage database, meaning each column’s data is stored separately, not like row-oriented databases storing each row as a whole. Columnar storage advantage is it can read only relevant columns for specific queries, greatly reducing I/O operations, especially performing well during aggregation or filtering operations.
Tables consist of data parts sorted by primary key. When data is inserted into table, it is divided into data parts and sorted by primary key dictionary order.
ClickHouse creates index file for each data part, index file includes primary key value for each index row, index row number defined as n * index_granularity.
Columnar Storage
In MergeTree, data is stored by column, specifically for each field column, has its own bin data file. Benefits of columnar storage:
- Better compression
- Minimize data scan range
MergeTree stores data into bin steps:
- Compress data
- Sort by ORDER BY
- Data written to bin file in compressed blocks
Compressed Data Block
Compressed data block consists of two parts:
- Header information (fixed 9 bytes)
- Compressed data
Header format: CompressionMethod_CompressedSize_UncompressedSize
Can view compression situation using:
cd /var/lib/clickhouse/data/default/mt_table/202407_1_1_0
clickhouse-compressor --stat data.bin out.log
Compressed block generation rules (default 8192 index granularity):
- If single batch data x < 64k, continue reading next batch, generate next block when size > 64k
- If single batch data 64k < x < 1M, directly generate next block
- If x > 1M, split data by 1M, remaining data continues according to above rules
Data Marks
In ClickHouse, mark is part of index, used to mark start position of data blocks in data file. Marks help quickly locate needed data blocks.
Mark contains:
- Block start position
- Min and max values for each column in block
- Other metadata information
Mark granularity can be configured via index_granularity. Smaller mark granularity means larger mark file space, but better query performance.
.mrk file: Establishes mapping between index primary.idx and data file bin.
- Data marks and index intervals are aligned, can find data mark based on index interval index number
- Each [Column].bin has corresponding [Column].mrk
- .mrk file records offset of data in bin file
Partition, Index, Mark and Compression Collaboration
Partition
ClickHouse’s partitioning mechanism is a storage strategy splitting large table data into independent logical segments. Each partition is equivalent to an independent subset of table.
Partition key definition:
- Single column partition:
PARTITION BY toYYYYMM(date_column) - Multi-column partition:
PARTITION BY (toYYYYMM(date_column), city) - Expression partition:
PARTITION BY sipHash64(user_id) % 4
Main advantages:
- Query efficiency improvement - partition pruning
- Data management convenience - TTL, delete, archive, migration
- Time series processing
Index
ClickHouse’s index differs from traditional databases, mainly relies on primary key index and sparse index.
- Primary key index: Determines data sorting order, assists data query, is a sparse index
- Sparse index: Only marks certain rows, reduces storage overhead
- Skip Indexes: minmax, set, bloom_filter, etc.
Marks
Marks are basis for sparse index implementation. During queries, marks are used to skip unneeded blocks, accelerating query process.
Compression Collaboration
ClickHouse provides multiple compression algorithms:
- LZ4 (default): Fast, lightweight
- ZSTD: High compression ratio
- Delta, DoubleDelta: Designed for time-series data
Write Process
- Generate partition directory
- Merge partition directories
- Generate primary.idx index file, each column’s bin and mrk files
Query Process
- Narrow query range based on partition
- Narrow query range based on data marks
- Decompress data blocks
MergeTree’s TTL
TTL: time to live data survival time, can be set on table or column.
-- TTL set on column
CREATE TABLE ttl_table_v1 (
id String,
create_time DateTime,
code String TTL create_time + INTERVAL 10 SECOND,
type UInt8 TTL create_time + INTERVAL 10 SECOND
) ENGINE = MergeTree
PARTITION BY toYYYYMM(create_time)
ORDER BY id;
-- Insert data
INSERT INTO ttl_table_v1 VALUES
('A0000', now(), 'c1', 1),
('A0000', now() + INTERVAL 10 MINUTE, 'c1', 1);
-- Manually trigger merge
OPTIMIZE TABLE ttl_table_v1 FINAL;
-- TTL set on table
CREATE TABLE tt1_table_v2 (
id String,
create_time DateTime,
code String,
type UInt8
) ENGINE = MergeTree
PARTITION BY toYYYYMM(create_time)
ORDER BY create_time
TTL create_time + INTERVAL 1 DAY;
-- Modify TTL
ALTER TABLE tt1_table_v1 MODIFY TTL create_time + INTERVAL + 3 DAY;
Reproducible Simple Script
-- 1) Create table: monthly partition + typical primary key
CREATE TABLE mt_demo (
counter_id UInt32,
dt DateTime,
city LowCardinality(String),
v Float64
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(dt)
ORDER BY (counter_id, dt)
SETTINGS index_granularity = 8192;
-- 2) Generate data: 10 million rows
INSERT INTO mt_demo
SELECT
randUniform(1, 50000) AS counter_id,
now() - randUniform(0, 86400*90) AS dt,
concat('c', toString(randUniform(1, 5000))) AS city,
randCanonical() AS v
FROM numbers(10000000);
-- 3) Typical query: pruning + skip reading
SELECT avg(v)
FROM mt_demo
WHERE counter_id = 42 AND dt BETWEEN now()-86400 AND now();
-- 4) Observe system tables: partition/part/marks/compression
SELECT partition, name, rows, bytes_on_disk
FROM system.parts WHERE table = 'mt_demo' AND active;
SELECT partition, name, column, marks, data_compressed_bytes
FROM system.parts_columns WHERE table = 'mt_demo' AND active
ORDER BY partition, name, column;
Partition Strategy and Maintenance
-- View partition/row count/space
SELECT partition, rows, disk_size FROM system.parts WHERE table='mt_demo' AND active;
-- Single partition optimize
OPTIMIZE TABLE mt_demo PARTITION 202510 FINAL;
-- Move/delete/detach
ALTER TABLE mt_demo DROP PARTITION 202409;
ALTER TABLE mt_demo DETACH PARTITION 202409;
Error Quick Reference
| Symptom | Possible Root Cause | Quick Locating Method | Handling Solution |
|---|---|---|---|
| WHERE hits but almost full table scan | Primary key doesn’t include filter column / granularity too large | EXPLAIN AST/PIPELINE, view read_rows | Adjust primary key and index_granularity, or add skip index |
| TTL not effective | Data not due / merge not triggered | Create data with ‘past time’ + OPTIMIZE FINAL | Wait for background merge or manually trigger |
| Disk spikes | Too many small partitions/parts | system.parts view part count | Adjust partition granularity, accelerate merge |