Replica Introduction

ReplicatedMergeTree ZooKeeper: Implements communication between multiple instances.

Replica Characteristics

As the primary carrier for data replicas, ReplicatedMergeTree has some design considerations:

  • ZooKeeper Dependency: When executing INSERT and ALTER queries, ReplicatedMergeTree needs ZooKeeper’s distributed coordination to achieve synchronization between replicas. However, ZooKeeper is not required when querying replicas.
  • Table-level Replicas: Replicas are defined at the table level, so each table’s replica configuration can be customized based on actual needs, including number of replicas and their distribution in the cluster.
  • Multi Master Architecture: INSERT and ALTER queries can be executed on any replica, they have the same effect. These operations are distributed to each replica for local execution via ZooKeeper coordination.
  • Block Data Blocks: When writing data with INSERT command, data is split into several Block data blocks based on max_block_size (default 1048576 rows). Block data blocks are the basic unit of data writing, with atomicity and uniqueness.
  • Atomicity: During data writing, data within a Block either all succeeds or all fails.
  • Uniqueness: When writing a Block data block, hash digest is calculated and recorded based on data order, rows, and size within the Block. If a Block to be written has the same hash digest as a previously written Block (same data order, size, and rows), that Block is ignored. This design prevents duplicate Block writes caused by exceptions.

ZK Configuration

<yandex>
  <zookeeper-servers>
    <node index="1">
      <host>h121.wzk.icu</host>
      <port>2181</port>
    </node>
    <node index="2">
      <host>h122.wzk.icu</host>
      <port>2181</port>
    </node>
    <node index="3">
      <host>h123.wzk.icu</host>
      <port>2181</port>
    </node>
  </zookeeper-servers>
</yandex>

Enable ZK

Need to enable in config file:

vim /etc/clickhouse-server/config.xml

# Add the following line where you configured before
<include_from>/etc/clickhouse-server/config.d/metrika.xml</include_from>
# If you didn't have the following line before
<zookeeper incl="zookeeper-servers" optional="true" />

Restart Service

systemctl restart clickhouse-server

Verify

-- Connect to ClickHouse
clickhouse-client -m --host h121.wzk.icu --port 9001 --user default --password clickhouse@wzk.icu

-- Check if successfully connected to ZooKeeper
SELECT * FROM system.zookeeper WHERE path = '/';

Replica Definition

Create New Table

CREATE TABLE replicated_sales_5(
  `id` String,
  `price` Float64,
  `create_time` DateTime
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/01/replicated_sales_5', 'h121.wzk.icu')
PARTITION BY toYYYYMM(create_time)
ORDER BY id;
  • /clickhouse/tables - conventional path
  • /01/ - shard number
  • replicated_sales_5 - table name, recommend same as physical table
  • h121.wzk.icu - replica name in ZK, conventionally server name

ReplicatedMergeTree Principle

Data Structure

[zk: localhost:2181(CONNECTED) 7] ls /clickhouse/tables/01/replicated_sales_5
[alter_partition_version, block_numbers, blocks, columns, leader_election, log, metadata, mutations, nonincrement_block_numbers, part_moves_shard, pinned_part_uuids, quorum, replicas, table_shared_id, temp, zero_copy_hdfs, zero_copy_s3]

Metadata:

  • metadata: Metadata - primary key, sampling expression, partition key
  • columns: Column field data types and names
  • replicas: Replica names

Flags:

  • leader_election: Master replica election path
  • blocks: Hash value (duplicate data insertion), partition_id
  • max_insert_block_size: 1048576 rows
  • block_numbers: Block order within same partition
  • quorum: Number of replicas

Operations:

  • log: log-000000 - regular operations
  • mutations: delete update

Create Table 1

CREATE TABLE a1(
  id String,
  price Float64,
  create_time DateTime
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/01/a1', 'h121.wzk.icu')
PARTITION BY toYYYYMM(create_time)
ORDER BY id;
  • Initialize all ZK nodes based on zk_path
  • Register own replica instance under replicas node
  • Start listening task, listen to LOG node
  • Participate in replica election, elect master replica - first to insert becomes master

Create Table 2

Create second replica instance:

clickhouse-client -m --host h122.wzk.icu --port 9001 --user default --password clickhouse@wzk.icu
CREATE TABLE a1(
  id String,
  price Float64,
  create_time DateTime
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/01/a1', 'h122.wzk.icu')
PARTITION BY toYYYYMM(create_time)
ORDER BY id;

At this point participate in replica election, h121.wzk.icu becomes master replica.

Insert Data 1

insert into table a1 values('A001',100,'2024-08-20 08:00:00');

View Results

After execution, view data on ZK:

ls /clickhouse/tables/01/a1/blocks

Output shows: After insert command executes, local partition directory is written, then write block_id for that partition:

[zk: localhost:2181(CONNECTED) 6] ls /clickhouse/tables/01/a1/blocks
[202408_16261221490105862188_1058020630609096934]

View Logs

Next, h121.wzk.icu replica initiates push of operation log to LOG:

[zk: localhost:2181(CONNECTED) 7] ls /clickhouse/tables/01/a1/log
[log-0000000000]

Insert another piece of data:

insert into table a1 values('A002',200,'2024-08-21 08:00:00');

View LOG:

ls /clickhouse/tables/01/a1/log
get /clickhouse/tables/01/a1/log/log-0000000000
get /clickhouse/tables/01/a1/log/log-0000000001

Output:

[zk: localhost:2181(CONNECTED) 14] ls /clickhouse/tables/01/a1/log
[log-0000000000, log-0000000001]

[zk: localhost:2181(CONNECTED) 13] get /clickhouse/tables/01/a1/log/log-0000000000
format version: 4
create_time: 2024-08-01 17:10:35
source replica: h121.wzk.icu
block_id: 202408_16261221490105862188_1058020630609096934
get
202408_0_0_0
part_type: Compact

[zk: localhost:2181(CONNECTED) 16] get /clickhouse/tables/01/a1/log/log-0000000001
format version: 4
create_time: 2024-08-01 17:16:37
source replica: h121.wzk.icu
block_id: 202408_3260633639629896920_11326802927295833243
get
202408_1_1_0
part_type: Compact

Pull Logs

Next, second replica pulls LOG: h122.wzk.icu node always monitors /log node changes. When h121.wzk.icu pushes /log/log-0000, 0001, h122.wzk.icu triggers log pull task and updates log_pointer.

Error Quick Reference

SymptomRoot CauseLocationFix
Not connected to ZooKeeperZK address/permission/networktelnet zk 2181 / system.zookeeperFix config, firewall, restart
Long-time absolute_delay > 0Slow log pulling/backlogsystem.replication_queueSYSTEM SYNC REPLICA; check disk/CPU
Replica … already existsDuplicate registrationsystem.replicasDROP/DETACH then ATTACH or change {replica}
Too many parts read-onlyToo many small partssystem.partsAdjust batch/merge strategy, reduce write fragmentation
Transaction/DDL inconsistentDDL not executed on all replicassystem.query_logAlways use ON CLUSTER or cluster_all_replicas=1

ClickHouse Sharding × Replica × Distributed Practice