Basic Introduction
ClickHouse is a columnar database for OLAP (Online Analytical Processing), favored in big data analysis for its high-speed data processing. ClickHouse SQL syntax is similar to standard SQL, but has special extensions due to its analytical focus. ClickHouse doesn’t support direct DELETE or UPDATE by default, but can indirectly clean data through partition management and merge mechanism. ClickHouse provides many functions designed for efficient analysis, including rich aggregate functions.
Basic SQL Syntax
ClickHouse SQL syntax is similar to standard SQL, but has special extensions due to analytical focus.
When creating table:
CREATE TABLE table_name (
column1 DataType,
column2 DataType,
...
) ENGINE = MergeTree()
ORDER BY (primary_key_columns);
- ENGINE: Table engine, most commonly used is MergeTree series.
- ORDER BY: Must specify sort key, supports efficient queries on large datasets.
- PARTITION BY: Partition by column (optional).
- SAMPLE BY: Used for sampling queries on large data volumes (optional).
When deleting or cleaning table data: ClickHouse doesn’t support direct DELETE or UPDATE by default, but can indirectly clean data through partition management and merge mechanism.
ALTER TABLE table_name DROP PARTITION partition_expr;
Special Features
Aggregate Functions
ClickHouse provides rich aggregate functions like sum(), avg(), min(), max(), count(). Also has special aggregate functions:
SELECT uniqExact(column) FROM table_name; -- Exact distinct count
SELECT quantiles(0.5, 0.9)(column) FROM table_name; -- Quantile calculation
Window Functions
ClickHouse supports window functions, but syntax differs slightly. Common window functions: row_number(), rank():
SELECT column, rowNumber() OVER (PARTITION BY partition_column ORDER BY sort_column)
FROM table_name;
Array and Nested Types
ClickHouse supports array and nested types, suitable for handling complex data structures:
SELECT arrayJoin(array) FROM table_name;
- arrayJoin: Expands array to multiple rows
MergeTree Engine
MergeTree is one of the most commonly used engines in ClickHouse, with sorting, indexing and partitioning features, efficiently handling massive data.
- ORDER BY: Defines primary key, data is sorted by these fields.
- PRIMARY KEY: Can be same as ORDER BY, for fast positioning.
- PARTITION BY: Used to logically shard data, reduce query scope.
- TTL: Sets data expiration time, automatically cleans historical data.
Current Environment
My ClickHouse cluster environment:
- h121.wzk.icu
- h122.wzk.icu
- h123.wzk.icu
Establish Connection
Connect to any node randomly:
clickhouse-client -m --host h121.wzk.icu --port 9001 --user default --password clickhouse@wzk.icu
Create Database
CREATE DATABASE mydatabase;
View Databases
SHOW DATABASES;
Create Table
# Method 1
CREATE TABLE my_table(
title String,
url String,
eventTime DateTime
) ENGINE = Memory;
# Method 2
CREATE TABLE mydatabase.my_table(
title String,
url String,
eventTime DateTime
) ENGINE = Memory;
# Method 3
CREATE TABLE mydatabase.my_table_2(
title String,
url String,
eventTime DateTime
) ENGINE = Memory AS SELECT * FROM mydatabase.my_table;
View Table Structure
DESC my_table;
Insert Data
INSERT INTO my_table VALUES ('wzk', '123', now());
Temporary Table
CREATE TABLE tmp_v1 (
title String,
create_time DateTime
) ENGINE = Memory;
If temporary table has same name as normal table, temporary table takes priority. Temporary table engine can only be Memory, data is temporary, lost on disconnect. More often in ClickHouse, it’s a carrier for data spreading across clusters.
Partitioned Table
Create Table
CREATE TABLE partition_v1 (
`id` String,
`url` String,
`eventTime` Date
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(eventTime)
ORDER BY id;
Only MergeTree family table engines support partitioned tables. Using partitioned tables enables targeted queries and reduces query scope. Partition field should not be too granular.
Insert Data
INSERT INTO partition_v1 (id, url, eventTime) VALUES
('1', 'http://example.com/page1', '2024-01-01'),
('2', 'http://example.com/page2', '2024-01-15'),
('3', 'http://example.com/page3', '2024-02-01'),
('4', 'http://example.com/page4', '2024-02-15'),
('5', 'http://example.com/page5', '2024-03-01'),
('6', 'http://example.com/page6', '2024-03-15');
Query Data
SELECT * FROM partition_v1;
View Partitions
SELECT table, partition, path FROM system.parts WHERE table = 'partition_v1';
View Tables
- Normal View: Doesn’t store data, just a SELECT query mapping layer, simplifies queries
- Materialized View: Stores data, when source table receives data, materialized view also updates synchronously
- POPULATE Modifier: Determines whether to sync source table data to materialized view during creation
Table Basic Operations
Only MergeTree, Merge, Distribution table engines support ALTER operations!!!
Add Column
ALTER TABLE partition_v1 ADD COLUMN os String default 'mac';
ALTER TABLE partition_v1 ADD COLUMN ip String after id;
DESC partition_v1;
Modify Type
Note: Types need to be compatible
ALTER TABLE partition_v1 modify column ip IPv4;
DESC partition_v1;
Modify Comment
ALTER TABLE partition_v1 COMMENT COLUMN id 'Primary Key ID';
DESC partition_v1;
Delete Column
ALTER TABLE partition_v1 DROP COLUMN url;
DESC partition_v1;
Note: Deleting column deletes all data under that column.
Move Table
rename TABLE default.partition_v1 to mydatabase.partition_v1;
USE mydatabase;
SHOW TABLES;
Partition Operations
View Partition
SELECT partition_id, name, table, database FROM system.parts where table = 'partition_v1';
Delete Partition
ALTER TABLE partition_v1 DROP PARTITION 202401;
SELECT partition_id, name, table, database FROM system.parts where table = 'partition_v1';
Copy Partition
ALTER TABLE partition_v2 replace partition 202402 FROM partition_v1;
Clear Partition
ALTER TABLE partition_v1 CLEAR COLUMN ip in partition 202402;
- Clears ip column (sets to default value)
- Clear operation doesn’t delete records, just sets specified column to default value (like 0 or NULL, depending on column’s default setting)
Unmount Partition
ALTER TABLE partition_v1 DETACH partition 202402;
SELECT partition_id, name, table, database FROM system.parts where table = 'partition_v1';
Remount Partition
ALTER TABLE partition_v1 ATTACH partition 202402;
SELECT partition_id, name, table, database FROM system.parts where table = 'partition_v1';