Apache Kylin Cube Practice: From Modeling to Build and Qu...

TL;DR

Scenario: Using e-commerce sales fact table, pre-compute aggregation queries accelerated by “date” dimension on Kylin.
Conclusion: Reasonably pruning Cuboids and optimizing dimension/fact tables can achieve balance between storage and performance; build and query link works through.
Output: Full process guide for Project/DataSource/Model/Cube, monitoring and common error location table, practical SQL examples.

Version Matrix

Module	Technical Points	Engineering Suggestions
Data Source Prep	Hive table design, partition field selection	Dimension table primary key unique, single table ≤300MB; fact table fields refined, avoid Hive views
Dimension Modeling	Star/Snowflake model selection, dimension hierarchy definition	Dimension table snapshots reusable; time dimension separated; fact table foreign key mapping avoid complex JOIN
Cube Design	Dimension/measure config, Aggregation Group division	High-frequency dimensions in separate group, low-cardinality dimensions prune first
Pre-compute Optimization	Cuboid pruning strategies (Mandatory/Hierarchy/Joint)	Use derived dimensions to reduce Cuboid count, Spark build replace MapReduce
Storage & Compression	HBase column family design, Snappy/LZO compression selection	Cold/hot data separation, RowKey design avoid Region hotspot
Incremental Build	Auto merge strategy, segment build time range control	Partition by hour/day, retain minimum necessary historical segments
Real-time OLAP (Kylin 4.0)	Kafka stream intake, micro-batch processing window config	Message partition align with dimension tables, enable Exactly-Once semantics
Monitoring & Tuning	Cube build time metrics, query hit rate analysis	Use Grafana to monitor Cuboid utilization, optimize queries exceeding 50ms

SQL Example Code Blocks

-- Dimension table optimization example
CREATE TABLE dim_product (
  product_id STRING PRIMARY KEY,
  category_path STRING COMMENT 'hierarchy dimension',
  price_bucket INT COMMENT 'derived dimension'
) STORED AS ORC;

-- Fact table optimization example
CREATE TABLE fact_sales (
  order_id STRING,
  product_id STRING COMMENT 'foreign key mapping',
  dt DATE PARTITIONED,
  amount DECIMAL(18,2)
) PARTITIONED BY (dt) STORED AS PARQUET;

Formulas and Parameters

Key parameter calculation formulas:

Cuboid Count Estimation N = ∏ i = 1 n ( L i + 1 )
Storage Optimization Threshold S o p t i m a l = T b a s e × C r a t i o R c o m p

Cube Introduction

Apache Kylin is an open-source distributed analysis engine, focused on providing real-time OLAP (Online Analytical Processing) capabilities for big data. Originally developed by eBay and contributed to Apache Software Foundation, it has become an important OLAP solution in the Hadoop ecosystem. Kylin can integrate with various big data components including HBase, Spark, Kafka, etc., especially suitable for processing petabyte-level massive data.

Cube is one of Apache Kylin’s core concepts, pre-computes multi-dimensional data sets for large-scale data to accelerate complex SQL queries. Cube design borrows from star and snowflake models in data warehousing, but uses unique pre-computation mechanism. Specifically, Cube pre-computes and stores all possible dimension combinations (called Cuboids), so queries can directly use pre-computed results without scanning original data in real-time.

Key points of Cube detailed below:

Multi-dimensional Modeling: Cube builds on fact tables and dimension tables, supports star and snowflake models. For e-commerce analysis, fact table can be sales records, dimension tables include time, product, region, etc.
Pre-computation Mechanism: Kylin pre-computes various dimension combinations (Cuboids) during Cube build. For example, a Cube with 5 dimensions theoretically computes 2^5=32 Cuboids, including all dimension combination scenarios.
Storage Optimization: Kylin uses HBase as storage engine, stores pre-computed results in columnar format, uses efficient encoding and compression. For example, can choose Snappy or LZO compression algorithms to reduce storage space.
Incremental Build: Supports incremental Cube build, only processes new data without rebuilding entire Cube. Especially useful for scenarios with daily new data, like daily sales report analysis.
Smart Pruning: Through techniques like Aggregation Group, can compute only important Cuboids, achieving balance between query performance and storage cost. For example, can set certain dimension combinations as mandatory, while other combinations computed on-demand.

Cube Basic Concepts

Cube in Kylin is a pre-computed data structure generated after multi-dimensional modeling on a set of fact tables (usually business data tables). Cube involves combinations of dimensions and metrics on multi-dimensional data, enabling pre-computed results retrieval at query time to significantly reduce computation overhead.

Dimension: Fields in data used for grouping, filtering and slicing, such as time, region, product, etc.
Measure: Fields that usually need aggregation calculation, such as sales amount, order count, etc.
Cuboid: Each Cube consists of multiple Cuboids, a Cuboid is a subset of specific dimension combination. Each dimension combination in Cube generates a Cuboid, each Cuboid stores pre-aggregated results under that combination.

Cube Creation Process

Data Modeling: First create a Data Model in Kylin, this model defines relationships between fact tables and dimension tables, similar to star or snowflake schema. Model also defines metric fields to aggregate.
Cube Design: Design Cube based on Data Model, specify Cube’s dimensions and measures. Kylin automatically calculates all possible dimension combinations (Cuboids) based on definitions.
Build Cube: Build process reads underlying data sources (like Hive, HBase, Kafka), then generates pre-computed data for each Cuboid based on specified dimensions and measures. These pre-computed results stored in HBase or other storage engines.

Cube Query and Optimization

Query Acceleration: When SQL query request arrives, Kylin selects appropriate Cuboid to return based on dimension combinations involved in query, avoids real-time computation, greatly improves query performance.
Cube Optimization: To control Cube size and accelerate build, Kylin supports pruning Cube, configures to generate only partial Cuboids, called “Aggregation Group”, can reduce redundant computation.

Real-time OLAP

Kylin 4.0 introduced real-time OLAP support, using Kafka as real-time stream data input to build real-time Cube. Using Lambda architecture, Kylin can support integrated analysis of real-time and batch data.

Cube Typical Application Scenarios

Large-scale Data Analysis

Cube is especially suitable for processing ultra large-scale datasets above PB level. Through unique pre-computation mechanism, can improve complex aggregation query performance by hundreds of times. For example:

E-commerce platforms can analyze hundreds of millions of user behavior records
Financial industry can process massive transaction data
Telecom operators can analyze TB-level call records

Real-time Analysis

Real-time Cube solution supports instant analysis of streaming data:

Uses Lambda architecture to process batch and stream data simultaneously
Supports real-time intake from message queues like Kafka
Typical applications include: real-time monitoring systems, instant marketing effect analysis, real-time fraud detection

Business Intelligence (BI) Tool Integration

Kylin provides comprehensive BI tool integration capabilities:

Supports standard JDBC/ODBC interfaces
Deep integration with mainstream BI tools: Tableau, Power BI, Superset
Retains complete SQL syntax support, including: complex multi-table joins, nested subqueries, window functions and other advanced analysis features

Dimension Table Optimization

Must have data consistency, primary key values must be unique (otherwise Kylin build process will error)
Dimension tables should be as small as possible because Kylin loads dimension tables into memory for query use, tables that are too large are not suitable as dimension tables, default threshold is 300MB
Low change frequency, Kylin tries to reuse dimension table snapshots in each build, if dimension tables change frequently, reuse will fail, requiring frequent snapshot creation for dimension tables
Dimension tables should preferably not be Hive views (View), because each time requires materializing the view, causing additional time overhead

Fact Table Optimization

Remove fields not participating in Cube build, can improve build speed and reduce Cube result size
As much as possible split dimension from fact table, extract common dimensions
Ensure mapping relationship between dimension and fact table, filter records that cannot be mapped

Create Cube (By Date)

Core Steps

DataSource => Model => Cube

Model: Describes a star schema data structure, defines FactTable and LookUpTable and relationships between them
Multiple Cubes can be created based on one Model, can reduce duplicate work

Cube Design

Dimension: date
Measure: order product sales quantity, total sales amount

select date1, sum(price), sum(amount)
from dw_sales
group by date1;

Execution Steps

Create Project (not required)
Create DataSource, specify which data needs data analysis
Create Model, specify which fact table, dimension table to analyze
Create Cube, specify which data model to preprocess, generate data for different dimensions
Execute build, wait for build completion
Execute SQL query, get results, query from Cube

Error Quick Reference

Symptom	Root Cause Location	Fix
Build fails, log shows dimension table conflict	Dimension table primary key not unique	Check duplicate keys in build log; dedupe dimension table to ensure unique primary key; preprocess to dedupe/filter dirty data
Build extremely slow/timeout	Selected MR engine with too many Cuboids	Monitor Task stage duration in monitoring page; enable Aggregation Group pruning; increase parallelism and resource queue
Dimension table load OOM	Dimension table too large or frequent snapshots	Check Job log memory error; reduce dimension columns, split dimension; raise threshold config or change query approach
Query misses Cube, falls back to source slow	Dimension/measure don’t match query	Check Query Profile/logs to find missing Cuboid or adjust SQL (consistent dimension/filter)
HBase read amplification/hotspot	RowKey design or Region imbalance	Adjust partition and pre-split in HBase UI/RegionServer metrics; optimize encoding and aggregation level
Build stuck at Snapshot	Dimension table changes frequently	Observe snapshot step duration; stabilize dimension table update cycle; enable snapshot reuse strategy
Kafka real-time latency (if enabled)	Batch size/concurrency improper	Tune batch and concurrency for consumption lag, end-to-end latency; distinguish cold/hot paths (Lambda)
JDBC connection failed	Driver/URL/permission error	Check client error code to correct driver and URL; add whitelist and grant read-only account
MapReduce task fails	Insufficient resources/over quota	Adjust container memory and concurrency in Yarn logs/retry; bind Job to separate queue
Aggregation result abnormal	Fact→dimension mapping missing	Compare foreign key null value ratio; filter unmappable records; improve ETL validation rules