Big Data 137 - ClickHouse MergeTree Practical Guide
TL;DR
- Scenario: Time-series/log tables have high writes, slow queries, want to benefit from MergeTree throughput and range query advantages
- Conclusion: Using SOP in this article can run through PARTITION BY + ORDER BY + index_granularity minimum closed loop in 10 minutes
- Output: Detailed analysis of ClickHouse MergeTree structure principle analysis
MergeTree
The most powerful table engine in ClickHouse is the MergeTree engine and other engines in this series. MergeTree series engines are ClickHouse’s core storage engine, specifically designed for high-throughput data writes and efficient queries, especially suitable for time-series data, log analysis and other big data scenarios.
MergeTree Engine Series Basic Concept
When you have massive data to insert into a table, batch writing data in parts is much more efficient than traditional real-time single-row inserts. The specific working principle:
-
Data Write Phase:
- Data written in batches as data parts
- Each data part internally sorted by primary key
- Write process is append-only, doesn’t modify existing data
-
Background Merge Process:
- System automatically triggers merge operations periodically (usually 10-15 minutes)
- Merges multiple small data parts into larger ordered data parts
- Merge process optimizes storage structure and removes duplicate data
- Merge operations are asynchronous, don’t affect write performance
Advantages
- High Write Throughput: Batch writes can reach millions of rows per second
- High Query Efficiency: Ordered storage data is better for range queries
- Good Resource Utilization: Merge operations can be scheduled when system load is low
Actual Application Scenarios
- IoT device data collection (hundreds of thousands of sensor data per second)
- Network traffic log analysis (TB-level log storage and queries)
- E-commerce user behavior analysis (massive click stream data storage)
MergeTree Series Variant Engines
MergeTree series also includes multiple variant engines like ReplacingMergeTree (automatic deduplication), SummingMergeTree (automatic aggregation), providing specialized optimization for different scenarios.
Storage Structure
Create Table
CREATE TABLE mt_table(date Date, id UInt8, name String)
ENGINE = MergeTree PARTITION BY toYYYYMM(date) ORDER BY id;
CREATE TABLE mt_table3 (
`date` Date,
`id` UInt8,
`name` String
) ENGINE = MergeTree PARTITION BY toYYYYMM(date) ORDER BY id;
Insert Data
INSERT INTO mt_table VALUES ('2024-07-31', 1, 'wzk');
INSERT INTO mt_table VALUES ('2024-07-30', 2, 'icu');
INSERT INTO mt_table VALUES ('2024-07-29', 3, 'wzkicu');
View Directory
cd /var/lib/clickhouse/data/default/mt_table
ls
Storage Structure
.
├── 202407_1_1_0
│ ├── checksums.txt
│ ├── columns.txt
│ ├── count.txt
│ ├── data.bin
│ ├── data.mrk3
│ ├── default_compression_codec.txt
│ ├── minmax_date.idx
│ ├── partition.dat
│ └── primary.idx
├── 202407_2_2_0
│ ├── checksums.txt
│ ├── columns.txt
│ ├── count.txt
│ ├── data.bin
│ ├── data.mrk3
│ ├── default_compression_codec.txt
│ ├── minmax_date.idx
│ ├── partition.dat
│ └── primary.idx
├── 202407_3_3_0
│ ├── checksums.txt
│ ├── columns.txt
│ ├── count.txt
│ ├── data.bin
│ ├── data.mrk3
│ ├── default_compression_codec.txt
│ ├── minmax_date.idx
│ ├── partition.dat
│ └── primary.idx
├── detached
└── format_version.txt
File Description
- checksums.txt: Binary checksum file, stores size and hash of other files for quick integrity verification
- columns.txt: Plain text column information
- data.bin: Compressed (default LZ4) data file, stores original data, named by column name bin
- data.mrk2: Uses adaptive index interval size
- primary.idx: Binary primary index file, sparse index declared via order by or primary key in table creation
Data Partitioning
Data is organized in the form of partition directories, each partition stored independently. This form can effectively skip useless data files when querying.
Partition Rules
Partition key values generate partition ID, partition determined according to ID, partition ID generation currently has four rules based on different partition key data types:
- No partition key specified
- Using integer
- Using date type toYYYYMM(date)
- Using other types
When writing data, it falls into corresponding partition according to partition ID.
Partition Directory Generation
- BlockNum: A global integer starting from 1, increments when new partition directory created
- MinBlockNum: Minimum data block number
- MaxBlockNum: Maximum data block number
- For a new partition, MinBlockNum and MaxBlockNum values are the same
Partition Directory Merge
MergeTree partition directories are created during data write process, different batch writes belonging to same partition also generate different directories, then merged at some point (10-15 minutes after write), old partition directories deleted after default 8 minutes.
Naming rules after merging multiple directories of same partition:
- MinBlockNum: Takes smallest MinBlockNum value in same partition
- MaxBlockNum: Takes largest MaxBlockNum value in same partition
- Level: Takes largest Level value in same partition +1
Primary Index
Sparse Index
- File: primary.idx
- Concept: MergeTree’s primary key defined by PrimaryKey, after primary key defined, MergeTree generates primary index based on index_granularity interval (default 8192) and saves to primary.idx, this method is sparse index
- Simplified form: Primary key referred through ORDER BY
Primary index file uses sparse index:
- Dense index: Each index row corresponds to one row of actual data
- Sparse index: Each index row corresponds to a segment of data records (default index granularity is 8192)
Sparse index takes small space, so index data in primary.idx is often resident in memory, fast access.
Generation Rules
primary.idx file, due to sparse index, MergeTree generates one index record every index_granularity rows, other index values obtained based on declared primary key fields.
Query Process
How does index work? Query process for primary.idx file:
- MarkRange: A small data segment, divides complete data into multiple small data segments based on index_granularity interval, small data segment is MarkRange
- MarkRange corresponds to index number
Small case:
- 200 rows of data
- index_granularity size is 5
- Primary key ID is int, values starting from 0
Total 200 rows / 5 = 40 MarkRanges
Assume index query where Id = 3:
- Step 1: Form interval format [3,3]
- Step 2: Perform intersection [3,3] ∩ [0, 199]
- Prune by blocks where MarkRange step > 8
- Step 3: Merge, MarkRange(start0, end20)
In ClickHouse, index column of MergeTree table specified using ORDER BY syntax in table creation.
Note: ClickHouse’s sparse index differs from Kafka’s sparse index, can be freely combined by user with multiple columns, so also be careful not to add too many index columns to prevent index being too sparse, increasing storage and lookup costs. Also, columns with small cardinality (low selectivity) are not suitable as index columns because multiple mark values may still be the same, losing index meaning.
Skip Index
- index_granularity defines data granularity
- granularity defines granularity for aggregated information summary
- granularity defines how many index_granularity intervals one skip index row can skip
Available Types
-
minmax: Stores extreme values of specified expression (if expression is tuple, stores extremes of each element), used to skip data blocks similar to primary key
-
set(max_rows): Stores unique values of specified expression (not exceeding max_rows, max_rows=0 means unlimited). Can be used to check if WHERE expression satisfies certain data block
-
ngrambf_v1: Stores bloom filter of all N-gram phrases in data block. Can only be used on strings, to optimize equals, like and in expression performance
-
tokenbf_v1: Similar to ngrambf_v1, different from ngrams storing all fragments of specified length, it only stores fragments separated by non-alphabetic characters
Minimum Runnable Example
# 1) Start a local ClickHouse (Docker)
docker run -d --name ch -p 8123:8123 -p 9000:9000 clickhouse/clickhouse-server:latest
# 2) Create table (partition by month, sort by id)
cat > mt_schema.sql <<'SQL'
CREATE DATABASE IF NOT EXISTS demo;
DROP TABLE IF EXISTS demo.mt;
CREATE TABLE demo.mt (
date Date,
id UInt32,
name String
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(date)
ORDER BY (id)
SETTINGS index_granularity = 8192;
SQL
docker exec -i ch clickhouse-client --multiquery < mt_schema.sql
# 3) Insert sample data (3 batches)
cat > mt_insert.sh <<'SH'
#!/usr/bin/env bash
for d in 2024-07-29 2024-07-30 2024-07-31; do
docker exec -i ch clickhouse-client -q \
"INSERT INTO demo.mt SELECT '$d', number, concat('name_', toString(number)) FROM numbers(100000)"
done
SH
bash mt_insert.sh
# 4) Verify: see 3 parts, range query hits
docker exec -i ch clickhouse-client -q "
SELECT partition, name, rows, marks
FROM system.parts
WHERE database='demo' AND table='mt' AND active
ORDER BY name;"
docker exec -i ch clickhouse-client -q "
SELECT count() FROM demo.mt WHERE id BETWEEN 1000 AND 2000;"
Rollback Script (Example)
-- 1) First freeze snapshot (strongly recommended)
ALTER TABLE demo.mt FREEZE WITH NAME 'pre_optimize';
-- 2) If merge effect is not good: DETACH partition, then restore from snapshot
ALTER TABLE demo.mt DETACH PARTITION 202407; -- Only remove from metadata
-- Manually copy /shadow/{UUID}/data/demo/mt/202407_* back, then
ALTER TABLE demo.mt ATTACH PARTITION 202407;
FAQ
- PRIMARY KEY vs ORDER BY: In MergeTree often equivalent; declaring ORDER BY defines sparse primary index
- How to choose index columns: Prefer high-selectivity, commonly used for filter/sort; don’t over-combine multiple columns to avoid sparse index
- Whether to change index_granularity: Keep default 8192 for most scenarios; evaluate only for extreme narrow scan/high concurrency
- Compact vs Wide part: Small parts often Compact, merge/threshold达标 becomes Wide; coexistence is normal
- Can OPTIMIZE … FINAL replace background merge: No, FINAL is forced merge with high cost, used for specific windows or maintenance