Big Data 138 - ClickHouse MergeTree Deep Dive: Partition Pruning × Sparse Primary Index × Marks × Compression

TL;DR

  • Scenario: Online table has severe read amplification under MergeTree, queries ‘scan full columns’, TTL not effective
  • Conclusion: Using combination of partition pruning + sparse primary index + mark granularity, with compression and OPTIMIZE, can reduce I/O for same queries to 10-30%
  • Output: Reproducible DDL/queries and diagnosis with marks/compression ratio/TTL verification method

Version Matrix

ComponentVersion/TypeNote
ClickHouse Server24.x/25.x.mrk2 common marker extension; default index_granularity = 8192
OS/FSUbuntu 22.04 / ext4Example machine; different FS only affects I/O details
Clientclickhouse-clientUsed for SQL and system table queries
Toolclickhouse-compressorView column compression statistics

MergeTree

Data Storage

ClickHouse is a columnar storage database, meaning each column’s data is stored separately, not like row-oriented databases storing each row as a whole. Columnar storage advantage is it can read only relevant columns for specific queries, greatly reducing I/O operations, especially performing well during aggregation or filtering operations.

Tables consist of data parts sorted by primary key. When data is inserted into table, it is divided into data parts and sorted by primary key dictionary order.

ClickHouse creates index file for each data part, index file includes primary key value for each index row, index row number defined as n * index_granularity.

Columnar Storage

In MergeTree, data is stored by column, specifically for each field column, has its own bin data file. Benefits of columnar storage:

  • Better compression
  • Minimize data scan range

MergeTree stores data into bin steps:

  1. Compress data
  2. Sort by ORDER BY
  3. Data written to bin file in compressed blocks

Compressed Data Block

Compressed data block consists of two parts:

  • Header information (fixed 9 bytes)
  • Compressed data

Header format: CompressionMethod_CompressedSize_UncompressedSize

Can view compression situation using:

cd /var/lib/clickhouse/data/default/mt_table/202407_1_1_0
clickhouse-compressor --stat data.bin out.log

Compressed block generation rules (default 8192 index granularity):

  • If single batch data x < 64k, continue reading next batch, generate next block when size > 64k
  • If single batch data 64k < x < 1M, directly generate next block
  • If x > 1M, split data by 1M, remaining data continues according to above rules

Data Marks

In ClickHouse, mark is part of index, used to mark start position of data blocks in data file. Marks help quickly locate needed data blocks.

Mark contains:

  • Block start position
  • Min and max values for each column in block
  • Other metadata information

Mark granularity can be configured via index_granularity. Smaller mark granularity means larger mark file space, but better query performance.

.mrk file: Establishes mapping between index primary.idx and data file bin.

  • Data marks and index intervals are aligned, can find data mark based on index interval index number
  • Each [Column].bin has corresponding [Column].mrk
  • .mrk file records offset of data in bin file

Partition, Index, Mark and Compression Collaboration

Partition

ClickHouse’s partitioning mechanism is a storage strategy splitting large table data into independent logical segments. Each partition is equivalent to an independent subset of table.

Partition key definition:

  • Single column partition: PARTITION BY toYYYYMM(date_column)
  • Multi-column partition: PARTITION BY (toYYYYMM(date_column), city)
  • Expression partition: PARTITION BY sipHash64(user_id) % 4

Main advantages:

  1. Query efficiency improvement - partition pruning
  2. Data management convenience - TTL, delete, archive, migration
  3. Time series processing

Index

ClickHouse’s index differs from traditional databases, mainly relies on primary key index and sparse index.

  • Primary key index: Determines data sorting order, assists data query, is a sparse index
  • Sparse index: Only marks certain rows, reduces storage overhead
  • Skip Indexes: minmax, set, bloom_filter, etc.

Marks

Marks are basis for sparse index implementation. During queries, marks are used to skip unneeded blocks, accelerating query process.

Compression Collaboration

ClickHouse provides multiple compression algorithms:

  • LZ4 (default): Fast, lightweight
  • ZSTD: High compression ratio
  • Delta, DoubleDelta: Designed for time-series data

Write Process

  1. Generate partition directory
  2. Merge partition directories
  3. Generate primary.idx index file, each column’s bin and mrk files

Query Process

  1. Narrow query range based on partition
  2. Narrow query range based on data marks
  3. Decompress data blocks

MergeTree’s TTL

TTL: time to live data survival time, can be set on table or column.

-- TTL set on column
CREATE TABLE ttl_table_v1 (
  id String,
  create_time DateTime,
  code String TTL create_time + INTERVAL 10 SECOND,
  type UInt8 TTL create_time + INTERVAL 10 SECOND
) ENGINE = MergeTree
PARTITION BY toYYYYMM(create_time)
ORDER BY id;

-- Insert data
INSERT INTO ttl_table_v1 VALUES
('A0000', now(), 'c1', 1),
('A0000', now() + INTERVAL 10 MINUTE, 'c1', 1);

-- Manually trigger merge
OPTIMIZE TABLE ttl_table_v1 FINAL;

-- TTL set on table
CREATE TABLE tt1_table_v2 (
  id String,
  create_time DateTime,
  code String,
  type UInt8
) ENGINE = MergeTree
PARTITION BY toYYYYMM(create_time)
ORDER BY create_time
TTL create_time + INTERVAL 1 DAY;

-- Modify TTL
ALTER TABLE tt1_table_v1 MODIFY TTL create_time + INTERVAL + 3 DAY;

Reproducible Simple Script

-- 1) Create table: monthly partition + typical primary key
CREATE TABLE mt_demo (
  counter_id UInt32,
  dt DateTime,
  city LowCardinality(String),
  v Float64
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(dt)
ORDER BY (counter_id, dt)
SETTINGS index_granularity = 8192;

-- 2) Generate data: 10 million rows
INSERT INTO mt_demo
SELECT
  randUniform(1, 50000) AS counter_id,
  now() - randUniform(0, 86400*90) AS dt,
  concat('c', toString(randUniform(1, 5000))) AS city,
  randCanonical() AS v
FROM numbers(10000000);

-- 3) Typical query: pruning + skip reading
SELECT avg(v)
FROM mt_demo
WHERE counter_id = 42 AND dt BETWEEN now()-86400 AND now();

-- 4) Observe system tables: partition/part/marks/compression
SELECT partition, name, rows, bytes_on_disk
FROM system.parts WHERE table = 'mt_demo' AND active;

SELECT partition, name, column, marks, data_compressed_bytes
FROM system.parts_columns WHERE table = 'mt_demo' AND active
ORDER BY partition, name, column;

Partition Strategy and Maintenance

-- View partition/row count/space
SELECT partition, rows, disk_size FROM system.parts WHERE table='mt_demo' AND active;

-- Single partition optimize
OPTIMIZE TABLE mt_demo PARTITION 202510 FINAL;

-- Move/delete/detach
ALTER TABLE mt_demo DROP PARTITION 202409;
ALTER TABLE mt_demo DETACH PARTITION 202409;

Error Quick Reference

SymptomPossible Root CauseQuick Locating MethodHandling Solution
WHERE hits but almost full table scanPrimary key doesn’t include filter column / granularity too largeEXPLAIN AST/PIPELINE, view read_rowsAdjust primary key and index_granularity, or add skip index
TTL not effectiveData not due / merge not triggeredCreate data with ‘past time’ + OPTIMIZE FINALWait for background merge or manually trigger
Disk spikesToo many small partitions/partssystem.parts view part countAdjust partition granularity, accelerate merge