TL;DR
- Scenario: Using e-commerce sales fact table, pre-compute aggregation queries accelerated by “date” dimension on Kylin.
- Conclusion: Reasonably pruning Cuboids and optimizing dimension/fact tables can achieve balance between storage and performance; build and query link works through.
- Output: Full process guide for Project/DataSource/Model/Cube, monitoring and common error location table, practical SQL examples.
Version Matrix
| Module | Technical Points | Engineering Suggestions |
|---|---|---|
| Data Source Prep | Hive table design, partition field selection | Dimension table primary key unique, single table ≤300MB; fact table fields refined, avoid Hive views |
| Dimension Modeling | Star/Snowflake model selection, dimension hierarchy definition | Dimension table snapshots reusable; time dimension separated; fact table foreign key mapping avoid complex JOIN |
| Cube Design | Dimension/measure config, Aggregation Group division | High-frequency dimensions in separate group, low-cardinality dimensions prune first |
| Pre-compute Optimization | Cuboid pruning strategies (Mandatory/Hierarchy/Joint) | Use derived dimensions to reduce Cuboid count, Spark build replace MapReduce |
| Storage & Compression | HBase column family design, Snappy/LZO compression selection | Cold/hot data separation, RowKey design avoid Region hotspot |
| Incremental Build | Auto merge strategy, segment build time range control | Partition by hour/day, retain minimum necessary historical segments |
| Real-time OLAP (Kylin 4.0) | Kafka stream intake, micro-batch processing window config | Message partition align with dimension tables, enable Exactly-Once semantics |
| Monitoring & Tuning | Cube build time metrics, query hit rate analysis | Use Grafana to monitor Cuboid utilization, optimize queries exceeding 50ms |
SQL Example Code Blocks
-- Dimension table optimization example
CREATE TABLE dim_product (
product_id STRING PRIMARY KEY,
category_path STRING COMMENT 'hierarchy dimension',
price_bucket INT COMMENT 'derived dimension'
) STORED AS ORC;
-- Fact table optimization example
CREATE TABLE fact_sales (
order_id STRING,
product_id STRING COMMENT 'foreign key mapping',
dt DATE PARTITIONED,
amount DECIMAL(18,2)
) PARTITIONED BY (dt) STORED AS PARQUET;
Formulas and Parameters
Key parameter calculation formulas:
-
Cuboid Count Estimation N = ∏ i = 1 n ( L i + 1 )
-
Storage Optimization Threshold S o p t i m a l = T b a s e × C r a t i o R c o m p
Cube Introduction
Apache Kylin is an open-source distributed analysis engine, focused on providing real-time OLAP (Online Analytical Processing) capabilities for big data. Originally developed by eBay and contributed to Apache Software Foundation, it has become an important OLAP solution in the Hadoop ecosystem. Kylin can integrate with various big data components including HBase, Spark, Kafka, etc., especially suitable for processing petabyte-level massive data.
Cube is one of Apache Kylin’s core concepts, pre-computes multi-dimensional data sets for large-scale data to accelerate complex SQL queries. Cube design borrows from star and snowflake models in data warehousing, but uses unique pre-computation mechanism. Specifically, Cube pre-computes and stores all possible dimension combinations (called Cuboids), so queries can directly use pre-computed results without scanning original data in real-time.
Key points of Cube detailed below:
-
Multi-dimensional Modeling: Cube builds on fact tables and dimension tables, supports star and snowflake models. For e-commerce analysis, fact table can be sales records, dimension tables include time, product, region, etc.
-
Pre-computation Mechanism: Kylin pre-computes various dimension combinations (Cuboids) during Cube build. For example, a Cube with 5 dimensions theoretically computes 2^5=32 Cuboids, including all dimension combination scenarios.
-
Storage Optimization: Kylin uses HBase as storage engine, stores pre-computed results in columnar format, uses efficient encoding and compression. For example, can choose Snappy or LZO compression algorithms to reduce storage space.
-
Incremental Build: Supports incremental Cube build, only processes new data without rebuilding entire Cube. Especially useful for scenarios with daily new data, like daily sales report analysis.
-
Smart Pruning: Through techniques like Aggregation Group, can compute only important Cuboids, achieving balance between query performance and storage cost. For example, can set certain dimension combinations as mandatory, while other combinations computed on-demand.
Cube Basic Concepts
Cube in Kylin is a pre-computed data structure generated after multi-dimensional modeling on a set of fact tables (usually business data tables). Cube involves combinations of dimensions and metrics on multi-dimensional data, enabling pre-computed results retrieval at query time to significantly reduce computation overhead.
- Dimension: Fields in data used for grouping, filtering and slicing, such as time, region, product, etc.
- Measure: Fields that usually need aggregation calculation, such as sales amount, order count, etc.
- Cuboid: Each Cube consists of multiple Cuboids, a Cuboid is a subset of specific dimension combination. Each dimension combination in Cube generates a Cuboid, each Cuboid stores pre-aggregated results under that combination.
Cube Creation Process
- Data Modeling: First create a Data Model in Kylin, this model defines relationships between fact tables and dimension tables, similar to star or snowflake schema. Model also defines metric fields to aggregate.
- Cube Design: Design Cube based on Data Model, specify Cube’s dimensions and measures. Kylin automatically calculates all possible dimension combinations (Cuboids) based on definitions.
- Build Cube: Build process reads underlying data sources (like Hive, HBase, Kafka), then generates pre-computed data for each Cuboid based on specified dimensions and measures. These pre-computed results stored in HBase or other storage engines.
Cube Query and Optimization
- Query Acceleration: When SQL query request arrives, Kylin selects appropriate Cuboid to return based on dimension combinations involved in query, avoids real-time computation, greatly improves query performance.
- Cube Optimization: To control Cube size and accelerate build, Kylin supports pruning Cube, configures to generate only partial Cuboids, called “Aggregation Group”, can reduce redundant computation.
Real-time OLAP
Kylin 4.0 introduced real-time OLAP support, using Kafka as real-time stream data input to build real-time Cube. Using Lambda architecture, Kylin can support integrated analysis of real-time and batch data.
Cube Typical Application Scenarios
Large-scale Data Analysis
Cube is especially suitable for processing ultra large-scale datasets above PB level. Through unique pre-computation mechanism, can improve complex aggregation query performance by hundreds of times. For example:
- E-commerce platforms can analyze hundreds of millions of user behavior records
- Financial industry can process massive transaction data
- Telecom operators can analyze TB-level call records
Real-time Analysis
Real-time Cube solution supports instant analysis of streaming data:
- Uses Lambda architecture to process batch and stream data simultaneously
- Supports real-time intake from message queues like Kafka
- Typical applications include: real-time monitoring systems, instant marketing effect analysis, real-time fraud detection
Business Intelligence (BI) Tool Integration
Kylin provides comprehensive BI tool integration capabilities:
- Supports standard JDBC/ODBC interfaces
- Deep integration with mainstream BI tools: Tableau, Power BI, Superset
- Retains complete SQL syntax support, including: complex multi-table joins, nested subqueries, window functions and other advanced analysis features
Dimension Table Optimization
- Must have data consistency, primary key values must be unique (otherwise Kylin build process will error)
- Dimension tables should be as small as possible because Kylin loads dimension tables into memory for query use, tables that are too large are not suitable as dimension tables, default threshold is 300MB
- Low change frequency, Kylin tries to reuse dimension table snapshots in each build, if dimension tables change frequently, reuse will fail, requiring frequent snapshot creation for dimension tables
- Dimension tables should preferably not be Hive views (View), because each time requires materializing the view, causing additional time overhead
Fact Table Optimization
- Remove fields not participating in Cube build, can improve build speed and reduce Cube result size
- As much as possible split dimension from fact table, extract common dimensions
- Ensure mapping relationship between dimension and fact table, filter records that cannot be mapped
Create Cube (By Date)
Core Steps
DataSource => Model => Cube
- Model: Describes a star schema data structure, defines FactTable and LookUpTable and relationships between them
- Multiple Cubes can be created based on one Model, can reduce duplicate work
Cube Design
- Dimension: date
- Measure: order product sales quantity, total sales amount
select date1, sum(price), sum(amount)
from dw_sales
group by date1;
Execution Steps
- Create Project (not required)
- Create DataSource, specify which data needs data analysis
- Create Model, specify which fact table, dimension table to analyze
- Create Cube, specify which data model to preprocess, generate data for different dimensions
- Execute build, wait for build completion
- Execute SQL query, get results, query from Cube
Error Quick Reference
| Symptom | Root Cause Location | Fix |
|---|---|---|
| Build fails, log shows dimension table conflict | Dimension table primary key not unique | Check duplicate keys in build log; dedupe dimension table to ensure unique primary key; preprocess to dedupe/filter dirty data |
| Build extremely slow/timeout | Selected MR engine with too many Cuboids | Monitor Task stage duration in monitoring page; enable Aggregation Group pruning; increase parallelism and resource queue |
| Dimension table load OOM | Dimension table too large or frequent snapshots | Check Job log memory error; reduce dimension columns, split dimension; raise threshold config or change query approach |
| Query misses Cube, falls back to source slow | Dimension/measure don’t match query | Check Query Profile/logs to find missing Cuboid or adjust SQL (consistent dimension/filter) |
| HBase read amplification/hotspot | RowKey design or Region imbalance | Adjust partition and pre-split in HBase UI/RegionServer metrics; optimize encoding and aggregation level |
| Build stuck at Snapshot | Dimension table changes frequently | Observe snapshot step duration; stabilize dimension table update cycle; enable snapshot reuse strategy |
| Kafka real-time latency (if enabled) | Batch size/concurrency improper | Tune batch and concurrency for consumption lag, end-to-end latency; distinguish cold/hot paths (Lambda) |
| JDBC connection failed | Driver/URL/permission error | Check client error code to correct driver and URL; add whitelist and grant read-only account |
| MapReduce task fails | Insufficient resources/over quota | Adjust container memory and concurrency in Yarn logs/retry; bind Job to separate queue |
| Aggregation result abnormal | Fact→dimension mapping missing | Compare foreign key null value ratio; filter unmappable records; improve ETL validation rules |