Big Data 137 - ClickHouse MergeTree Practical Guide

TL;DR

Scenario: Time-series/log tables have high writes, slow queries, want to benefit from MergeTree throughput and range query advantages
Conclusion: Using SOP in this article can run through PARTITION BY + ORDER BY + index_granularity minimum closed loop in 10 minutes
Output: Detailed analysis of ClickHouse MergeTree structure principle analysis

MergeTree

The most powerful table engine in ClickHouse is the MergeTree engine and other engines in this series. MergeTree series engines are ClickHouse’s core storage engine, specifically designed for high-throughput data writes and efficient queries, especially suitable for time-series data, log analysis and other big data scenarios.

MergeTree Engine Series Basic Concept

When you have massive data to insert into a table, batch writing data in parts is much more efficient than traditional real-time single-row inserts. The specific working principle:

Data Write Phase:
- Data written in batches as data parts
- Each data part internally sorted by primary key
- Write process is append-only, doesn’t modify existing data
Background Merge Process:
- System automatically triggers merge operations periodically (usually 10-15 minutes)
- Merges multiple small data parts into larger ordered data parts
- Merge process optimizes storage structure and removes duplicate data
- Merge operations are asynchronous, don’t affect write performance

Advantages

High Write Throughput: Batch writes can reach millions of rows per second
High Query Efficiency: Ordered storage data is better for range queries
Good Resource Utilization: Merge operations can be scheduled when system load is low

Actual Application Scenarios

IoT device data collection (hundreds of thousands of sensor data per second)
Network traffic log analysis (TB-level log storage and queries)
E-commerce user behavior analysis (massive click stream data storage)

MergeTree Series Variant Engines

MergeTree series also includes multiple variant engines like ReplacingMergeTree (automatic deduplication), SummingMergeTree (automatic aggregation), providing specialized optimization for different scenarios.

Storage Structure

Create Table

CREATE TABLE mt_table(date Date, id UInt8, name String)
ENGINE = MergeTree PARTITION BY toYYYYMM(date) ORDER BY id;

CREATE TABLE mt_table3 (
  `date` Date,
  `id` UInt8,
  `name` String
) ENGINE = MergeTree PARTITION BY toYYYYMM(date) ORDER BY id;

Insert Data

INSERT INTO mt_table VALUES ('2024-07-31', 1, 'wzk');
INSERT INTO mt_table VALUES ('2024-07-30', 2, 'icu');
INSERT INTO mt_table VALUES ('2024-07-29', 3, 'wzkicu');

View Directory

cd /var/lib/clickhouse/data/default/mt_table
ls

Storage Structure

.
├── 202407_1_1_0
│   ├── checksums.txt
│   ├── columns.txt
│   ├── count.txt
│   ├── data.bin
│   ├── data.mrk3
│   ├── default_compression_codec.txt
│   ├── minmax_date.idx
│   ├── partition.dat
│   └── primary.idx
├── 202407_2_2_0
│   ├── checksums.txt
│   ├── columns.txt
│   ├── count.txt
│   ├── data.bin
│   ├── data.mrk3
│   ├── default_compression_codec.txt
│   ├── minmax_date.idx
│   ├── partition.dat
│   └── primary.idx
├── 202407_3_3_0
│   ├── checksums.txt
│   ├── columns.txt
│   ├── count.txt
│   ├── data.bin
│   ├── data.mrk3
│   ├── default_compression_codec.txt
│   ├── minmax_date.idx
│   ├── partition.dat
│   └── primary.idx
├── detached
└── format_version.txt

File Description

checksums.txt: Binary checksum file, stores size and hash of other files for quick integrity verification
columns.txt: Plain text column information
data.bin: Compressed (default LZ4) data file, stores original data, named by column name bin
data.mrk2: Uses adaptive index interval size
primary.idx: Binary primary index file, sparse index declared via order by or primary key in table creation

Data Partitioning

Data is organized in the form of partition directories, each partition stored independently. This form can effectively skip useless data files when querying.

Partition Rules

Partition key values generate partition ID, partition determined according to ID, partition ID generation currently has four rules based on different partition key data types:

No partition key specified
Using integer
Using date type toYYYYMM(date)
Using other types

When writing data, it falls into corresponding partition according to partition ID.

Partition Directory Generation

BlockNum: A global integer starting from 1, increments when new partition directory created
MinBlockNum: Minimum data block number
MaxBlockNum: Maximum data block number
For a new partition, MinBlockNum and MaxBlockNum values are the same

Partition Directory Merge

MergeTree partition directories are created during data write process, different batch writes belonging to same partition also generate different directories, then merged at some point (10-15 minutes after write), old partition directories deleted after default 8 minutes.

Naming rules after merging multiple directories of same partition:

MinBlockNum: Takes smallest MinBlockNum value in same partition
MaxBlockNum: Takes largest MaxBlockNum value in same partition
Level: Takes largest Level value in same partition +1

Primary Index

Sparse Index

File: primary.idx
Concept: MergeTree’s primary key defined by PrimaryKey, after primary key defined, MergeTree generates primary index based on index_granularity interval (default 8192) and saves to primary.idx, this method is sparse index
Simplified form: Primary key referred through ORDER BY

Primary index file uses sparse index:

Dense index: Each index row corresponds to one row of actual data
Sparse index: Each index row corresponds to a segment of data records (default index granularity is 8192)

Sparse index takes small space, so index data in primary.idx is often resident in memory, fast access.

Generation Rules

primary.idx file, due to sparse index, MergeTree generates one index record every index_granularity rows, other index values obtained based on declared primary key fields.

Query Process

How does index work? Query process for primary.idx file:

MarkRange: A small data segment, divides complete data into multiple small data segments based on index_granularity interval, small data segment is MarkRange
MarkRange corresponds to index number

Small case:

200 rows of data
index_granularity size is 5
Primary key ID is int, values starting from 0

Total 200 rows / 5 = 40 MarkRanges

Assume index query where Id = 3:

Step 1: Form interval format [3,3]
Step 2: Perform intersection [3,3] ∩ [0, 199]
- Prune by blocks where MarkRange step > 8
Step 3: Merge, MarkRange(start0, end20)

In ClickHouse, index column of MergeTree table specified using ORDER BY syntax in table creation.

Note: ClickHouse’s sparse index differs from Kafka’s sparse index, can be freely combined by user with multiple columns, so also be careful not to add too many index columns to prevent index being too sparse, increasing storage and lookup costs. Also, columns with small cardinality (low selectivity) are not suitable as index columns because multiple mark values may still be the same, losing index meaning.

Skip Index

index_granularity defines data granularity
granularity defines granularity for aggregated information summary
granularity defines how many index_granularity intervals one skip index row can skip

Available Types

minmax: Stores extreme values of specified expression (if expression is tuple, stores extremes of each element), used to skip data blocks similar to primary key
set(max_rows): Stores unique values of specified expression (not exceeding max_rows, max_rows=0 means unlimited). Can be used to check if WHERE expression satisfies certain data block
ngrambf_v1: Stores bloom filter of all N-gram phrases in data block. Can only be used on strings, to optimize equals, like and in expression performance
tokenbf_v1: Similar to ngrambf_v1, different from ngrams storing all fragments of specified length, it only stores fragments separated by non-alphabetic characters

Minimum Runnable Example

# 1) Start a local ClickHouse (Docker)
docker run -d --name ch -p 8123:8123 -p 9000:9000 clickhouse/clickhouse-server:latest

# 2) Create table (partition by month, sort by id)
cat > mt_schema.sql <<'SQL'
CREATE DATABASE IF NOT EXISTS demo;
DROP TABLE IF EXISTS demo.mt;
CREATE TABLE demo.mt (
  date Date,
  id   UInt32,
  name String
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(date)
ORDER BY (id)
SETTINGS index_granularity = 8192;
SQL
docker exec -i ch clickhouse-client --multiquery < mt_schema.sql

# 3) Insert sample data (3 batches)
cat > mt_insert.sh <<'SH'
#!/usr/bin/env bash
for d in 2024-07-29 2024-07-30 2024-07-31; do
  docker exec -i ch clickhouse-client -q \
    "INSERT INTO demo.mt SELECT '$d', number, concat('name_', toString(number)) FROM numbers(100000)"
done
SH
bash mt_insert.sh

# 4) Verify: see 3 parts, range query hits
docker exec -i ch clickhouse-client -q "
SELECT partition, name, rows, marks
FROM system.parts
WHERE database='demo' AND table='mt' AND active
ORDER BY name;"

docker exec -i ch clickhouse-client -q "
SELECT count() FROM demo.mt WHERE id BETWEEN 1000 AND 2000;"

Rollback Script (Example)

-- 1) First freeze snapshot (strongly recommended)
ALTER TABLE demo.mt FREEZE WITH NAME 'pre_optimize';

-- 2) If merge effect is not good: DETACH partition, then restore from snapshot
ALTER TABLE demo.mt DETACH PARTITION 202407; -- Only remove from metadata
-- Manually copy /shadow/{UUID}/data/demo/mt/202407_* back, then
ALTER TABLE demo.mt ATTACH PARTITION 202407;

FAQ

PRIMARY KEY vs ORDER BY: In MergeTree often equivalent; declaring ORDER BY defines sparse primary index
How to choose index columns: Prefer high-selectivity, commonly used for filter/sort; don’t over-combine multiple columns to avoid sparse index
Whether to change index_granularity: Keep default 8192 for most scenarios; evaluate only for extreme narrow scan/high concurrency
Compact vs Wide part: Small parts often Compact, merge/threshold达标 becomes Wide; coexistence is normal
Can OPTIMIZE … FINAL replace background merge: No, FINAL is forced merge with high cost, used for specific windows or maintenance