ClickHouse MergeTree Partition/TTL, Materialized View, AL...

Basic Introduction

ClickHouse is a columnar database for OLAP (Online Analytical Processing), favored in big data analysis for its high-speed data processing. ClickHouse SQL syntax is similar to standard SQL, but has special extensions due to its analytical focus. ClickHouse doesn’t support direct DELETE or UPDATE by default, but can indirectly clean data through partition management and merge mechanism. ClickHouse provides many functions designed for efficient analysis, including rich aggregate functions.

Basic SQL Syntax

ClickHouse SQL syntax is similar to standard SQL, but has special extensions due to analytical focus.

When creating table:

CREATE TABLE table_name (
    column1 DataType,
    column2 DataType,
    ...
) ENGINE = MergeTree()
ORDER BY (primary_key_columns);

ENGINE: Table engine, most commonly used is MergeTree series.
ORDER BY: Must specify sort key, supports efficient queries on large datasets.
PARTITION BY: Partition by column (optional).
SAMPLE BY: Used for sampling queries on large data volumes (optional).

When deleting or cleaning table data: ClickHouse doesn’t support direct DELETE or UPDATE by default, but can indirectly clean data through partition management and merge mechanism.

ALTER TABLE table_name DROP PARTITION partition_expr;

Special Features

Aggregate Functions

ClickHouse provides rich aggregate functions like sum(), avg(), min(), max(), count(). Also has special aggregate functions:

SELECT uniqExact(column) FROM table_name; -- Exact distinct count
SELECT quantiles(0.5, 0.9)(column) FROM table_name; -- Quantile calculation

Window Functions

ClickHouse supports window functions, but syntax differs slightly. Common window functions: row_number(), rank():

SELECT column, rowNumber() OVER (PARTITION BY partition_column ORDER BY sort_column)
FROM table_name;

Array and Nested Types

ClickHouse supports array and nested types, suitable for handling complex data structures:

SELECT arrayJoin(array) FROM table_name;

arrayJoin: Expands array to multiple rows

MergeTree Engine

MergeTree is one of the most commonly used engines in ClickHouse, with sorting, indexing and partitioning features, efficiently handling massive data.

ORDER BY: Defines primary key, data is sorted by these fields.
PRIMARY KEY: Can be same as ORDER BY, for fast positioning.
PARTITION BY: Used to logically shard data, reduce query scope.
TTL: Sets data expiration time, automatically cleans historical data.

Current Environment

My ClickHouse cluster environment:

h121.wzk.icu
h122.wzk.icu
h123.wzk.icu

Establish Connection

Connect to any node randomly:

clickhouse-client -m --host h121.wzk.icu --port 9001 --user default --password clickhouse@wzk.icu

Create Database

CREATE DATABASE mydatabase;

View Databases

SHOW DATABASES;

Create Table

# Method 1
CREATE TABLE my_table(
  title String,
  url String,
  eventTime DateTime
) ENGINE = Memory;

# Method 2
CREATE TABLE mydatabase.my_table(
  title String,
  url String,
  eventTime DateTime
) ENGINE = Memory;

# Method 3
CREATE TABLE mydatabase.my_table_2(
  title String,
  url String,
  eventTime DateTime
) ENGINE = Memory AS SELECT * FROM mydatabase.my_table;

View Table Structure

DESC my_table;

Insert Data

INSERT INTO my_table VALUES ('wzk', '123', now());

Temporary Table

CREATE TABLE tmp_v1 (
  title String,
  create_time DateTime
) ENGINE = Memory;

If temporary table has same name as normal table, temporary table takes priority. Temporary table engine can only be Memory, data is temporary, lost on disconnect. More often in ClickHouse, it’s a carrier for data spreading across clusters.

Partitioned Table

Create Table

CREATE TABLE partition_v1 (
  `id` String,
  `url` String,
  `eventTime` Date
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(eventTime)
ORDER BY id;

Only MergeTree family table engines support partitioned tables. Using partitioned tables enables targeted queries and reduces query scope. Partition field should not be too granular.

Insert Data

INSERT INTO partition_v1 (id, url, eventTime) VALUES
('1', 'http://example.com/page1', '2024-01-01'),
('2', 'http://example.com/page2', '2024-01-15'),
('3', 'http://example.com/page3', '2024-02-01'),
('4', 'http://example.com/page4', '2024-02-15'),
('5', 'http://example.com/page5', '2024-03-01'),
('6', 'http://example.com/page6', '2024-03-15');

Query Data

SELECT * FROM partition_v1;

View Partitions

SELECT table, partition, path FROM system.parts WHERE table = 'partition_v1';

View Tables

Normal View: Doesn’t store data, just a SELECT query mapping layer, simplifies queries
Materialized View: Stores data, when source table receives data, materialized view also updates synchronously
POPULATE Modifier: Determines whether to sync source table data to materialized view during creation

Table Basic Operations

Only MergeTree, Merge, Distribution table engines support ALTER operations!!!

Add Column

ALTER TABLE partition_v1 ADD COLUMN os String default 'mac';
ALTER TABLE partition_v1 ADD COLUMN ip String after id;
DESC partition_v1;

Modify Type

Note: Types need to be compatible

ALTER TABLE partition_v1 modify column ip IPv4;
DESC partition_v1;

Modify Comment

ALTER TABLE partition_v1 COMMENT COLUMN id 'Primary Key ID';
DESC partition_v1;

Delete Column

ALTER TABLE partition_v1 DROP COLUMN url;
DESC partition_v1;

Note: Deleting column deletes all data under that column.

Move Table

rename TABLE default.partition_v1 to mydatabase.partition_v1;
USE mydatabase;
SHOW TABLES;

Partition Operations

View Partition

SELECT partition_id, name, table, database FROM system.parts where table = 'partition_v1';

Delete Partition

ALTER TABLE partition_v1 DROP PARTITION 202401;
SELECT partition_id, name, table, database FROM system.parts where table = 'partition_v1';

Copy Partition

ALTER TABLE partition_v2 replace partition 202402 FROM partition_v1;

Clear Partition

ALTER TABLE partition_v1 CLEAR COLUMN ip in partition 202402;

Clears ip column (sets to default value)
Clear operation doesn’t delete records, just sets specified column to default value (like 0 or NULL, depending on column’s default setting)

Unmount Partition

ALTER TABLE partition_v1 DETACH partition 202402;
SELECT partition_id, name, table, database FROM system.parts where table = 'partition_v1';

Remount Partition

ALTER TABLE partition_v1 ATTACH partition 202402;
SELECT partition_id, name, table, database FROM system.parts where table = 'partition_v1';