Big Data 133 - ClickHouse Concepts & Basics | Why Fast? Columnar + Vectorized + MergeTree Comparison

TL;DR

Scenario: Want high-concurrency low-latency OLAP, and don’t want to use entire Hadoop/lakehouse. Conclusion: ClickHouse’s core lies in columnar+vectorized+MergeTree+approximate statistics; suitable for ad-hoc analysis and near-real-time writes, not suitable for strong transactions and high-frequency row-level updates.

Brief Overview

ClickHouse is a fast open-source OLAP database management system, it is column-oriented, allows using SQL queries to generate analytical reports in real-time.

Brief Selection

Requirements/ConstraintsSuitableNot Suitable
High-concurrency, low-latency OLAP
Heavy transactions/strong-consistency OLTP
Near-real-time writes, ad-hoc analysis
Frequent row-level updates/deletes⚠️

OLTP vs OLAP

OLTP

OLTP: Online Transaction Processing

  • Application scenarios: ERP, CRM, process approval, data entry, etc.
  • Characteristics: Offline work goes online, data saved in respective systems (data silos)

OLAP

OLAP: On-Line Analytical Processing

  • Application scenarios: Analysis reports, analysis decisions, etc.
  • Solution 1: Data warehouse (Hadoop HDFS architecture) - complex architecture, low timeliness
  • Solution 2: ClickHouse, Kudu

ClickHouse Development History

Yandex company went public in 2011, ClickHouse open-sourced on June 15, 2016. ClickHouse was initially used for Yandex.Metrica (online traffic analysis product).

Technology evolution:

  • ROLAP: Traditional relational database OLAP, based on MySQL’s MyISAM table engine
  • MOLAP: Implemented data cube through materialized views
  • HOLAP: Combination of R and M
  • ClickHouse: ROLAP

ClickHouse Core Features

Extremely High Query Performance

  • Columnar Storage: Only reads columns needed for query, reducing disk I/O
  • Vectorized Computation: Batch processes data, improving CPU efficiency
  • Data Compression: Efficient compression algorithms, reducing storage costs

Horizontal Scalability

  • Distributed Architecture: Supports cluster deployment, easily handles PB-level data
  • Linear Scaling: Improves performance by adding nodes without downtime

Real-time Data Writing

  • High Throughput: Can insert millions of rows per second
  • Low Latency: Data immediately queryable after writing, meeting real-time analysis needs

Rich Feature Support

  • Various data types
  • Advanced SQL features: window functions, subqueries, JOINs, etc.
  • Materialized views: Pre-compute and store query results

Typical Application Scenarios

  • User behavior analysis
  • Log and monitoring data storage
  • Business Intelligence (BI)

ClickHouse Supported Features

  • True column-oriented DBMS
  • Efficient data compression
  • Data stored on disk
  • Multi-core parallel processing
  • Distributed processing across multiple servers
  • SQL syntax support
  • Vectorized engine
  • Real-time data updates
  • Indexes
  • Suitable for online queries
  • Supports approximate estimation calculations
  • Supports nested data structures
  • Supports arrays as data types
  • Supports limiting query complexity and quotas
  • Supports data replication and data integrity

ClickHouse Comparison

Commercial OLAP

  • HP Vertica, Actian the Vector
  • Difference: ClickHouse is open source and free

Cloud Solutions

  • Amazon RedShift, Google BigQuery
  • Difference: ClickHouse can be deployed on own machines, no cloud payment needed

Hadoop Ecosystem

  • Cloudera Impala, Spark SQL, Facebook Presto, Apache Drill
  • Difference: ClickHouse supports real-time high-concurrency systems, doesn’t depend on Hadoop ecosystem

Open Source OLAP Databases

  • InfiniDB, MonetDB, LucidDB
  • Difference: Small application scale, not tried in large-scale internet services

Non-relational Databases

  • Druid, Apache Kylin
  • Difference: ClickHouse can query directly from raw data

True Column-Oriented DBMS

To make queries faster:

  1. Reduce data scan range
  2. Reduce data transfer size during transmission

ClickHouse advantages:

  • Supports fixed-length values, 1 billion Int8 values consume about 1GB uncompressed disk space
  • Not just a single database, but a complete DBMS, supports DDL, DML, permission management, data backup, distributed storage, etc.

Data Compression

Data compression can improve performance

Disk Storage

Even across thousands of servers, memory is too small to store all browsing and sessions

Multi-core Parallel

Multi-core parallel processing for large queries

Distributed Processing Across Servers

Data can reside on different shards, each shard can be a set of replicas for fault tolerance

SQL Support

  • Supported queries include GROUP BY, ORDER BY
  • Subqueries supported in FROM, IN, JOIN clauses
  • Scalar subquery support
  • Correlated subqueries not supported

Vectorized Engine

Data is not only stored by columns, but processed by vector-column parts. Vectorized execution uses CPU’s SIMD instructions (Single Instruction Multiple Data), i.e., using single instruction to process multiple data.

Real-time Data Updates

ClickHouse supports primary key tables, data is incrementally sorted using MergeTree, data can be continuously added to table, lock-free processing when adding data.

Indexes

With primary key can extract data for specific clients within specific time range, latency less than tens of milliseconds.

Supports Online Queries

Can process queries in real-time without latency

Supports Approximate Calculations

  • System includes aggregate functions for approximate calculation of various values, medians and quantiles
  • Supports running queries on partial (sample) data to get approximate results
  • Supports running aggregations for limited number of random keys

Data Replication and Integrity Support

Uses asynchronous multi-master replication, after writing to any available replica, data is distributed to all remaining replicas.

Notes

ClickHouse is not perfect:

  • Does not support transactions
  • Although supports conditional Delete/Update (mutations), just non-transactional, asynchronous, high overhead for rewriting shard data
  • Supports limited operating systems

Summary

The emergence of ClickHouse open source has refreshed many companies wanting to do big data analysis. ClickHouse precisely features not depending on Hadoop ecosystem, simple installation and maintenance, fast query speed, SQL support, etc., going further and further in the big data field.