Big Data 133 - ClickHouse Concepts & Basics | Why Fast? Columnar + Vectorized + MergeTree Comparison
TL;DR
Scenario: Want high-concurrency low-latency OLAP, and don’t want to use entire Hadoop/lakehouse. Conclusion: ClickHouse’s core lies in columnar+vectorized+MergeTree+approximate statistics; suitable for ad-hoc analysis and near-real-time writes, not suitable for strong transactions and high-frequency row-level updates.
Brief Overview
ClickHouse is a fast open-source OLAP database management system, it is column-oriented, allows using SQL queries to generate analytical reports in real-time.
Brief Selection
| Requirements/Constraints | Suitable | Not Suitable |
|---|---|---|
| High-concurrency, low-latency OLAP | ✅ | |
| Heavy transactions/strong-consistency OLTP | ❌ | |
| Near-real-time writes, ad-hoc analysis | ✅ | |
| Frequent row-level updates/deletes | ⚠️ |
OLTP vs OLAP
OLTP
OLTP: Online Transaction Processing
- Application scenarios: ERP, CRM, process approval, data entry, etc.
- Characteristics: Offline work goes online, data saved in respective systems (data silos)
OLAP
OLAP: On-Line Analytical Processing
- Application scenarios: Analysis reports, analysis decisions, etc.
- Solution 1: Data warehouse (Hadoop HDFS architecture) - complex architecture, low timeliness
- Solution 2: ClickHouse, Kudu
ClickHouse Development History
Yandex company went public in 2011, ClickHouse open-sourced on June 15, 2016. ClickHouse was initially used for Yandex.Metrica (online traffic analysis product).
Technology evolution:
- ROLAP: Traditional relational database OLAP, based on MySQL’s MyISAM table engine
- MOLAP: Implemented data cube through materialized views
- HOLAP: Combination of R and M
- ClickHouse: ROLAP
ClickHouse Core Features
Extremely High Query Performance
- Columnar Storage: Only reads columns needed for query, reducing disk I/O
- Vectorized Computation: Batch processes data, improving CPU efficiency
- Data Compression: Efficient compression algorithms, reducing storage costs
Horizontal Scalability
- Distributed Architecture: Supports cluster deployment, easily handles PB-level data
- Linear Scaling: Improves performance by adding nodes without downtime
Real-time Data Writing
- High Throughput: Can insert millions of rows per second
- Low Latency: Data immediately queryable after writing, meeting real-time analysis needs
Rich Feature Support
- Various data types
- Advanced SQL features: window functions, subqueries, JOINs, etc.
- Materialized views: Pre-compute and store query results
Typical Application Scenarios
- User behavior analysis
- Log and monitoring data storage
- Business Intelligence (BI)
ClickHouse Supported Features
- True column-oriented DBMS
- Efficient data compression
- Data stored on disk
- Multi-core parallel processing
- Distributed processing across multiple servers
- SQL syntax support
- Vectorized engine
- Real-time data updates
- Indexes
- Suitable for online queries
- Supports approximate estimation calculations
- Supports nested data structures
- Supports arrays as data types
- Supports limiting query complexity and quotas
- Supports data replication and data integrity
ClickHouse Comparison
Commercial OLAP
- HP Vertica, Actian the Vector
- Difference: ClickHouse is open source and free
Cloud Solutions
- Amazon RedShift, Google BigQuery
- Difference: ClickHouse can be deployed on own machines, no cloud payment needed
Hadoop Ecosystem
- Cloudera Impala, Spark SQL, Facebook Presto, Apache Drill
- Difference: ClickHouse supports real-time high-concurrency systems, doesn’t depend on Hadoop ecosystem
Open Source OLAP Databases
- InfiniDB, MonetDB, LucidDB
- Difference: Small application scale, not tried in large-scale internet services
Non-relational Databases
- Druid, Apache Kylin
- Difference: ClickHouse can query directly from raw data
True Column-Oriented DBMS
To make queries faster:
- Reduce data scan range
- Reduce data transfer size during transmission
ClickHouse advantages:
- Supports fixed-length values, 1 billion Int8 values consume about 1GB uncompressed disk space
- Not just a single database, but a complete DBMS, supports DDL, DML, permission management, data backup, distributed storage, etc.
Data Compression
Data compression can improve performance
Disk Storage
Even across thousands of servers, memory is too small to store all browsing and sessions
Multi-core Parallel
Multi-core parallel processing for large queries
Distributed Processing Across Servers
Data can reside on different shards, each shard can be a set of replicas for fault tolerance
SQL Support
- Supported queries include GROUP BY, ORDER BY
- Subqueries supported in FROM, IN, JOIN clauses
- Scalar subquery support
- Correlated subqueries not supported
Vectorized Engine
Data is not only stored by columns, but processed by vector-column parts. Vectorized execution uses CPU’s SIMD instructions (Single Instruction Multiple Data), i.e., using single instruction to process multiple data.
Real-time Data Updates
ClickHouse supports primary key tables, data is incrementally sorted using MergeTree, data can be continuously added to table, lock-free processing when adding data.
Indexes
With primary key can extract data for specific clients within specific time range, latency less than tens of milliseconds.
Supports Online Queries
Can process queries in real-time without latency
Supports Approximate Calculations
- System includes aggregate functions for approximate calculation of various values, medians and quantiles
- Supports running queries on partial (sample) data to get approximate results
- Supports running aggregations for limited number of random keys
Data Replication and Integrity Support
Uses asynchronous multi-master replication, after writing to any available replica, data is distributed to all remaining replicas.
Notes
ClickHouse is not perfect:
- Does not support transactions
- Although supports conditional Delete/Update (mutations), just non-transactional, asynchronous, high overhead for rewriting shard data
- Supports limited operating systems
Summary
The emergence of ClickHouse open source has refreshed many companies wanting to do big data analysis. ClickHouse precisely features not depending on Hadoop ecosystem, simple installation and maintenance, fast query speed, SQL support, etc., going further and further in the big data field.