Apache Kylin Comprehensive Guide: MOLAP Architecture, Hiv...

TL;DR

Scenario: Massive details (100TB/billion-level), high-concurrency SQL analysis, require second-level response and visualization access.
Conclusion: Kylin uses pre-computed Cube/Cuboid + HBase storage to achieve MOLAP acceleration; real-time capability depends on model and build pipeline.
Output: One version/component selection matrix + common fault quick reference card, helps engineering implementation and problem location.

Version Matrix

Module/Capability	Verified Description
Data Source	Batch: Hive; Stream: Kafka (introduced V1.5, V2.4 supports JOIN with Hive dimension tables)
Compute/Build Engine	Early MapReduce; later supports Spark; docs also mention Flink capability (actual deployment as reference)
Storage Engine	HBase persists Cube; depends on RowKey design and column family layout optimization
Model	Star/Snowflake (V2.0); dimension pruning and Cuboid selection affects storage and query hit rate
Query Interface	SQL (Calcite parsing); JDBC/ODBC/REST, integrates with BI (Zeppelin, Tableau, etc.)
Real-time/Near-realtime	V1.6 introduced NRT; V3.0 labels Real-time OLAP, latency related to pipeline/aggregation strategy
Build Strategy	Supports full + time-based incremental build; supports merge and refresh
Routing Hit	Query engine routes logical plan to hit Cuboid; miss may fallback or performance degrade

Background History

Apache Kylin is an open-source distributed analysis engine, specifically for massive dataset multi-dimensional online analytical processing (MOLAP). This technology was initially developed by eBay China’s R&D team in 2013, aimed at solving eBay’s growing massive data analysis needs.

In October 2014, eBay officially donated the Kylin project to Apache Software Foundation, making it an Apache incubator project. After over a year of incubation, Kylin successfully graduated in November 2015, becoming an Apache Top-Level Project (TLP). This milestone made Apache Kylin the first open-source big data technology successfully led by Chinese developers to become an Apache TLP.

During technology development, core Apache Kylin development team founded Kyligence in 2016 (full name Kylin Intelligence). This startup received tens of millions in funding from Redpoint Ventures, Broadband Capital and other well-known investors, focusing on commercial enterprise-grade Kylin solutions development and services.

Worth mentioning, the name Apache Kylin comes from the Chinese mythological creature “Qilin” (unicorn), symbolizing auspiciousness and wisdom. This naming reflects Chinese developers’ inheritance of traditional culture, also representing the project’s vision for big data analysis technology.

Development History

2014 Kylin born, supports Hive batch data source, extracts value from massive historical data
2015 V1.5 First supports Kafka data source, uses single-machine micro-batch processing for build
2016 V1.6 released real-time (NRT Streaming), uses Hadoop micro-batch to consume streaming data
2017 V2.0 supports Snowflake model and Spark engine
eBay team starts trying real-time
2018 V2.4 supports Kafka streaming data JOIN with Hive dimension tables
eBay opens source real-time OLAP implementation
2019 Q1, after community review and refinement, merges Master
2019 Q4, V3.0 released Real-time OLAP, achieves second-level data preparation latency

Kylin provides multi-dimensional data analysis (MOLAP) with second-level response, currently used by many domestic companies.

Project Features

Data Source and Model: Mainly supports Hive, Kafka
Build Engine: Early supports MapReduce compute engine, new versions support Spark, Flink compute engines. Besides full build, supports incremental build based on time partition.
Storage Engine: Built Cube stored in HBase as Key-Value, accelerates queries through RowKey optimization. Each dimension combination’s calculation result saved as materialized view called Cuboid
Optimization Algorithm: Cube uses space for time, also prunes optimization based on algorithm, seeks balance
Access Interface: Supports standard SQL query, can integrate with Zeppelin, Tableau and other BI tools, SQL can be routed to corresponding Cuboid through query engine.

Application Scenarios

Characteristics: Kylin returns query results for massive data within sub-second.

Typical Kylin application scenarios:

Huge data volume, single data source table billion-level, single data source reaches 100TB
Huge query pressure (high-concurrency queries)
Fast query response
Downstream needs relatively flexible query methods, must support complex conditional SQL queries

Kylin’s core idea is pre-computation, pre-calculates all possible query results based on specified dimensions and metrics, uses space for time to accelerate OLAP queries with fixed patterns.

Basic Terms

Data Warehouse

Data warehouse is a theory for information system data storage, emphasizes using certain specific data storage methods to make data particularly beneficial for analysis and processing, generating valuable information and enabling decision-making.

Data stored in data warehouse has characteristic of not changing over time once stored. Additionally, stored data must include time attribute. Data warehouse contains large amount of historical data and can use specific analysis methods to discover specific information.

OLTP

Online Transaction Processing, traditional relational database application.

OLAP Classification

OLAP (Online Analytical Process), analyzes data in multi-dimensional way, presents integrated decision-making information method, mostly used in data warehouse or business intelligence. Main function is facilitate large-scale data analysis and statistical calculation, provide reference and support for decision-making. Differentiated from OLTP (Online Transaction Processing), which focuses on basic, daily transaction processing, mainly transactions’ CRUD.

OLAP concept in actual application has two different understandings: broad sense refers to all non-data-updating analysis processing. More often OLAP is understood in narrow sense: multi-dimensional analysis based on Cube computation.

OLAP has multiple implementation methods, classified by data storage:

ROLAP (Relational OLAP), detail and aggregated data saved in relational databases, Hive, SparkSQL belong to ROLAP.
MOLAP (Multidimensional OLAP), pre-computes aggregated data, stores in its own specific multi-dimensional database, OLAP operations can directly map to multi-dimensional database access. Essence is space for time. Kylin’s essence is MOLAP.
HOLAP (Hybrid OLAP), represents OLAP implementation based on hybrid data organization, lower layer relational, upper layer multi-dimensional matrix, has better flexibility.

Fact Table and Dimension Table

Fact Table is table storing factual records like system logs, sales records, sensor values, records grow dynamically, volume usually much larger than dimension tables.

Dimension Table or lookup table is table corresponding to fact table, stores dimension attribute values, can join with fact table. Equivalent to extracting frequently repeated attributes from fact table, standardizing into one table for management.

Common dimension tables: Date table (stores date corresponding to year, month, quarter, etc.), Region table (country, province, city, etc.). Dimension table changes usually not too frequent.

Dimension tables bring benefits:

Reduces fact table size
Facilitates dimension management and maintenance, adding, deleting, modifying dimension attributes doesn’t require modifying massive records in fact table
Dimension table can be reused by multiple fact tables

Dimensions and Metrics

userid,2020-10-01 09:00:00, produceid,shopid,orderid,299

Dimension is perspective for examining data, usually a data record attribute like: time, location, etc.

Metric is aggregated statistical value, result of aggregation operation, usually a numeric value like: total sales, distinct users, etc.

Analysts often combine multiple dimensions to examine metric values to find changing patterns. In SQL query, GROUP BY attributes are usually dimensions, calculated values are metrics.

SELECT
  part_dt,
  lstg_site_id,
  sum(price) as total_selled,
  count(DISTINCT seller_id) as sellers
FROM
  kylin_sales
GROUP BY part_dt, lstg_site_id;

In above query, part_dt, lstg_site_id are dimensions, sum(price), count(DISTINCT seller_id) are metrics.

Star & Snowflake Schema

Star Schema is commonly used data model in data warehouse dimensional modeling. Characteristic is one fact table, with one to multiple dimension tables, fact table and dimension tables related through primary/foreign keys, dimension tables not related to each other, like small stars around a star, hence called star schema.

Another commonly used is Snowflake Schema, extracts some dimension tables from star schema into more granular dimension tables, then lets dimension tables also relate to each other, shape like snowflake, hence called snowflake schema.

Cube and Cuboid

Cube (multi-dimensional cube), also called data cube.

This composed of three dimensions (more than 3 dimensions, above only for illustration) forms an OLAP cube, cube contains Cell (sub-cube) values meeting conditions, these Cells contain data to analyze, called metric values.

Cube: Multi-dimensional space constructed from dimensions, contains all basic data to analyze, all aggregated data operations performed on cube
Dimension: Perspective for observing data, generally a set of discrete values, for N dimensions, all possible combinations are 2^N
Metric: Result of aggregation calculation, generally continuous values
Cuboid: Specifically refers to data calculated under one dimension combination in Kylin
One field in fact table is either dimension or metric
Given a data model, can combine all dimensions on it, for N dimensions, all possible combinations are 2^N
Cube (or DataCube) is data cube, technique commonly used in data analysis, can build multi-dimensional index on original data, greatly accelerate query efficiency. Data cube is just an illustrative name for multi-dimensional model
Cuboid specifically refers to data calculated under one dimension combination in Kylin

Technical Architecture

Apache Kylin system can be divided into:

Online query
Offline build

Online query mode mainly in upper part, offline build in lower part.

Kylin technical architecture:

Data source mainly Hadoop Hive, data stored as relational tables, contains data to be analyzed. Based on metadata definition, build engine extracts data from data source and builds Cube
Kylin can use MapReduce or Spark as build engine, built Cube saved in storage engine on right, generally HBase as storage
After offline build complete, users can send SQL from query system for query analysis
Kylin provides various RestAPI, JDBC, ODBC interfaces. No matter which interface, SQL ultimately comes to Rest service layer, then transferred to query engine for processing.
SQL is written based on data source relational model, not Cube. Kylin deliberately hides Cube concept from query users. Only need understand relational model to use Kylin, no additional learning curve, traditional SQL applications easily migrate.
Query engine parses SQL, generates logical execution plan based on relational tables, then converts to physical execution plan based on Cube, finally queries pre-generated Cube and produces results, entire process doesn’t access original data source

Component Functions

REST Server: Provides RESTful interface, for operations like creating, building, refreshing, merging Cubes, Kylin Projects, Tables and other metadata management, user access control, SQL query, etc.
Query Engine: Uses open-source Apache Calcite framework for SQL parsing, can be understood as SQL engine layer
Routing: Responsible for converting execution plan generated from SQL parsing to Cube cache queries, these queries can complete in second or even millisecond level
Metadata: Kylin has massive metadata including Cube definition, star schema definition, Job and Job output information, model dimension information, etc. Kylin metadata and Cube both stored in HBase, stored as JSON strings
Cube Build Engine: Foundation of all modules, mainly responsible for creating Cube in Kylin pre-computation. Creation process first reads raw data through Hive, then generates HTable through some MapReduce or Spark computation, finally loads data into HBase tables.

Working Principle

Apache Kylin’s working principle is doing Cube computation on data model, using computed results to accelerate queries. Specific process:

Specify data model, define dimensions and metrics
Pre-compute Cube, calculate all Cuboids and save as materialized views (stored in HBase)
At query execution, read Cuboids, compute and produce query results

Efficient OLAP analysis:

Kylin’s query process doesn’t scan original records, but completes complex operations like table join, aggregation pre-computation
Using pre-computed results to execute queries, speed generally 1-2 orders of magnitude faster than non-pre-computed query technology, advantages more obvious on ultra-large datasets
When dataset reaches billion or trillion level, Kylin speed can exceed non-pre-computed technology by 1000+ times

Kylin Ecosystem

Core: Kylin’s OLAP engine consists of metadata engine, query engine, task engine, storage engine. Additionally has REST service providing query requests externally
Extensibility: Provides plugin mechanism to support extra features and functions
Integration: Can integrate with task schedulers, ETL tools, monitoring and alerting systems
Drivers: Provides ODBC, JDBC driver support for integration with other tools (like Tableau)

Error Quick Reference

Symptom	Root Cause	Location	Fix
Build task stuck/fails for long time	Source table skewed, huge shuffle, dictionary encoding too large	Check build Job logs and Stage metrics; check dimension cardinality and skewed keys	Adjust partitioning and parallelism; reduce cardinality/dimension for high-cardinality dimensions; enable skewed handling and pre-aggregation
Query occasionally slows/timeout	Cuboid miss or RowKey design poor	Enable query profiling; check hit Cuboid and scan range	Add common dimension combination Cuboid;重构 RowKey prefix and column family; create replica for hot queries
Near-realtime latency extends	Kafka partition/consumption lag, micro-batch interval and resource insufficient	Monitor consumption lag, build queue and HBase write rate	Increase parallelism and Executor resources; shorten micro-batch; optimize HBase writes (write buffer/Region split)
Result inconsistent with detail table statistics	Dimension dictionary/time partition inconsistent, distinct口径不同	Sample verify partition; check DISTINCT/aggregation口径	Unify口径 and timezone; rebuild affected partition; model separately for DISTINCT metrics
Build frequent OOM or RegionTooBusy	Too many Cuboids, single Region overheats	Check Region hot spots and memory statistics	Prune Cuboids; pre-partition and balance Regions; increase RegionServer resources
”Table not found/permission denied”	Metadata not synced, permission/connection string wrong	Verify MetaStore, Kylin metadata and credentials	Re-sync metadata; fix data source/driver config; calibrate minimum permission
Query fallback to detail causing slow	Dimension filter not in index prefix, missing key combination	Analyze query plan hit rate	Add common filter dimension prefix; supplement combined Cuboid; reorder filter conditions
Result abnormal after incremental merge	Partition boundary/waterline management improper	Audit incremental interval and merge logs	Unify partition waterline; verify with small partitions first before full rollout; rollback and full rebuild if needed