Cube Introduction
Apache Kylin is an open-source distributed analysis engine, focused on providing real-time OLAP (Online Analytical Processing) capabilities for big data. Cube is one of Apache Kylin’s core concepts, pre-computes multi-dimensional data sets for large-scale data to accelerate complex SQL queries.
Cube Basic Concepts
Cube in Kylin is a pre-computed data structure generated after multi-dimensional modeling on a set of fact tables (usually business data tables). Cube involves combinations of dimensions and metrics on multi-dimensional data, enabling pre-computed results retrieval at query time to significantly reduce computation overhead.
- Dimension: Fields in data used for grouping, filtering and slicing, such as time, region, product, etc.
- Measure: Fields that usually need aggregation calculation, such as sales amount, order count, etc.
- Cuboid: Each Cube consists of multiple Cuboids, a Cuboid is a subset of specific dimension combination. Each dimension combination in Cube generates a Cuboid, each Cuboid stores pre-aggregated results under that combination.
Cube Creation Process
- Data Modeling: First create a Data Model in Kylin, this model defines relationships between fact tables and dimension tables, similar to star or snowflake schema. Model also defines metric fields to aggregate.
- Cube Design: Design Cube based on Data Model, specify Cube’s dimensions and measures. Kylin automatically calculates all possible dimension combinations (Cuboids) based on definitions.
- Build Cube: Build process reads underlying data sources (like Hive, HBase, Kafka), then generates pre-computed data for each Cuboid based on specified dimensions and measures. These pre-computed results stored in HBase or other storage engines.
Cube Query and Optimization
- Query Acceleration: When SQL query request arrives, Kylin selects appropriate Cuboid to return based on dimension combinations involved in query, avoids real-time computation, greatly improves query performance.
- Cube Optimization: To control Cube size and accelerate build, Kylin supports Cube pruning, configures to generate only partial Cuboids, called “Aggregation Group”, can reduce redundant computation.
Real-time OLAP
Kylin 4.0 introduced real-time OLAP support, using Kafka as real-time data stream input to build real-time Cube. Using Lambda architecture, Kylin can support integrated analysis of real-time and batch data.
Typical Cube Application Scenarios
- Large-scale Data Analysis: Cube suitable for analyzing ultra-large-scale datasets, accelerates queries through pre-computation.
- Real-time Analysis: Real-time Cube allows users to analyze streaming data in near real-time.
- Business Intelligence (BI) Tool Integration: Kylin provides integration with common BI tools like Tableau, Power BI, users can use familiar SQL query language for complex multi-dimensional analysis.
Prerequisites
Need to have configured and started Kylin!
Prepare Data
Data Files
- dw_sales_data.txt
- dim_channel_data.txt
- dim_product_data.txt
- dim_region_data.txt
Data Generation Script (Python)
dw_sales_data:
import random
import datetime
# Set parameters
num_records = 1000
output_file = 'dw_sales_data.txt'
# Define possible values
channel_ids = ['C001', 'C002', 'C003', 'C004']
product_ids = ['P001', 'P002', 'P003', 'P004']
region_ids = ['R001', 'R002', 'R003', 'R004']
base_date = datetime.date(2024, 1, 1)
# Generate data
with open(output_file, 'w') as f:
for i in range(num_records):
record_id = f"{i+1:04d}"
date1 = (base_date + datetime.timedelta(days=random.randint(0, 365))).strftime('%Y-%m-%d')
channel_id = random.choice(channel_ids)
product_id = random.choice(product_ids)
region_id = random.choice(region_ids)
amount = random.randint(1, 100)
price = round(random.uniform(10.0, 500.0), 2)
line = f"{record_id},{date1},{channel_id},{product_id},{region_id},{amount},{price}\n"
f.write(line)
print(f"{num_records} records have been written to {output_file}")
dim_channel_data:
output_file = 'dim_channel_data.txt'
channels = [
('C001', 'Online Sales'),
('C002', 'Retail Store'),
('C003', 'Wholesale'),
('C004', 'Direct Sales')
]
with open(output_file, 'w') as f:
for channel_id, channel_name in channels:
line = f"{channel_id},{channel_name}\n"
f.write(line)
print(f"Channel data has been written to {output_file}")
dim_product_data:
output_file = 'dim_product_data.txt'
products = [
('P001', 'Smartphone'),
('P002', 'Laptop'),
('P003', 'Tablet'),
('P004', 'Smartwatch'),
('P005', 'Camera'),
('P006', 'Headphones'),
('P007', 'Monitor'),
('P008', 'Keyboard'),
('P009', 'Mouse'),
('P010', 'Printer')
]
with open(output_file, 'w') as f:
for product_id, product_name in products:
line = f"{product_id},{product_name}\n"
f.write(line)
print(f"Product data has been written to {output_file}")
dim_region_data:
output_file = 'dim_region_data.txt'
regions = [
('R001', 'North America'),
('R002', 'Europe'),
('R003', 'Asia'),
('R004', 'South America'),
('R005', 'Africa'),
('R006', 'Australia'),
('R007', 'Antarctica')
]
with open(output_file, 'w') as f:
for region_id, region_name in regions:
line = f"{region_id},{region_name}\n"
f.write(line)
print(f"Region data has been written to {output_file}")
Hive Create Table & Data Load SQL
-- Create order database, table structure
create database if not exists `wzk_kylin`;
-- 1, Sales table: dw_sales
create table wzk_kylin.dw_sales(
id string,
date1 string,
channelId string,
productId string,
regionId string,
amount int,
price double
)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
-- 2, Channel table: dim_channel
create table wzk_kylin.dim_channel(
channelId string,
channelName string
)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
-- 3, Product table: dim_product
create table wzk_kylin.dim_product(
productId string,
productName string
)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
-- 4, Region table: dim_region
create table wzk_kylin.dim_region(
regionId string,
regionName string
)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
-- Load data
LOAD DATA LOCAL INPATH '/opt/wzk/kylin_test/dw_sales_data.txt'
OVERWRITE INTO TABLE wzk_kylin.dw_sales;
LOAD DATA LOCAL INPATH '/opt/wzk/kylin_test/dim_channel_data.txt'
OVERWRITE INTO TABLE wzk_kylin.dim_channel;
LOAD DATA LOCAL INPATH '/opt/wzk/kylin_test/dim_product_data.txt'
OVERWRITE INTO TABLE wzk_kylin.dim_product;
LOAD DATA LOCAL INPATH '/opt/wzk/kylin_test/dim_region_data.txt'
OVERWRITE INTO TABLE wzk_kylin.dim_region;
Execute Data Load
cd /opt/wzk/kylin_test
hive -f kylin_examples.sql
Test Data
hive
use wzk_kylin;
select date1, sum(price) as total_money, sum(amount) as total_amount
from dw_sales
group by date1;
Error Quick Reference
| Symptom | Root Cause Location | Fix |
|---|---|---|
| Table not found wzk_kylin.dw_sales | Database/table not created or case mismatch | Re-execute create table script; confirm case and database prefix consistent |
| LOAD DATA LOCAL fails or data empty | Local path error/insufficient permission/separator mismatch | Fix path and permission; confirm CSV separator and column order consistent |
| Kylin build task stuck or failed | HDFS/YARN queue/insufficient permissions or MapReduce parameters | Increase queue resources; tune build parallelism and memory in Kylin |
| Query hits slow (not using pre-computation) | Dimension/measure or Aggregation Group design improper | Simplify dimension combinations, prune Cuboids reasonably; create separate AG for high-frequency queries |
| No realize for query | Cube doesn’t cover that dimension combination or time partition | Expand dimension combinations or supplement build that partition Segment |
| Real-time ingestion latency large | Kafka Topic/Schema doesn’t match model | Align Schema; increase batch/cache; verify timestamp and watermark |
| RegionTooBusy/Timeout (HBase) | Region hot spot or memory parameters too small | Pre-partition/hot-cold separation; increase MemStore/BlockCache |
| Permission related errors (Kerberos/ACL) | Cluster enabled security authentication | Check keytab and principal; acquire tickets and configure Kylin/Hive client |