TL;DR
- Scenario: E-commerce sales data multi-dimensional analysis, Cube modeling and pre-computation by date/region/product/channel
- Conclusion: Star modeling + Aggregation Group pruning can significantly reduce Cuboid count, queries achieve second-level response
- Output: Complete modeling flow, dimension design principles, Cuboid optimization strategies, error quick reference table
Version Matrix
| Module | Technical Points | Engineering Suggestions |
|---|---|---|
| Star Modeling | Fact table + dimension table, single table ≤300MB | Dimension table primary key unique, avoid Hive views |
| Dimension Design | Date/region/product/channel | High-frequency dimensions first, hierarchy dimensions handle separately |
| Measure Design | SUM/COUNT/AVG/MAX/MIN | Count Distinct needs cardinality evaluation |
| Aggregation Group | Dimension grouping, combination control | Split into multiple groups when dimension >15 |
| Mandatory Dimension | Must appear in WHERE/GROUP BY | Reduce invalid Cuboids |
| Hierarchy Dimension | Country→Province→City | Only pre-compute finest granularity |
| Joint Dimension | Low cardinality/often queried together | Either appear together or both not appear |
| Build Engine | Hive + Spark + HBase | Spark build replace MapReduce |
| Query Acceleration | Cuboid matching | Ensure query hits pre-computed results |
Modeling Flow
1. Data Preparation
-- Fact table: E-commerce sales records
CREATE TABLE fact_sales (
order_id STRING,
dt DATE COMMENT 'partition field',
region_id STRING COMMENT 'region foreign key',
product_id STRING COMMENT 'product foreign key',
channel_id STRING COMMENT 'channel foreign key',
amount DECIMAL(18,2) COMMENT 'sales amount',
quantity INT COMMENT 'sales quantity'
) PARTITIONED BY (dt) STORED AS PARQUET;
-- Dimension table: Region
CREATE TABLE dim_region (
region_id STRING PRIMARY KEY,
region_name STRING,
province STRING,
city STRING
) STORED AS ORC;
-- Dimension table: Product
CREATE TABLE dim_product (
product_id STRING PRIMARY KEY,
product_name STRING,
category STRING,
brand STRING
) STORED AS ORC;
-- Dimension table: Channel
CREATE TABLE dim_channel (
channel_id STRING PRIMARY KEY,
channel_name STRING,
channel_type STRING
) STORED AS ORC;
2. Model Definition
Create Model in Kylin Web UI:
- Fact table: fact_sales
- Dimension tables: dim_region, dim_product, dim_channel
- Partition field: dt (date)
- Measures: SUM(amount), SUM(quantity), COUNT(*)
3. Cube Design
Dimension Config
| Dimension Type | Field | Config |
|---|---|---|
| Mandatory | dt | Mandatory dimension, all queries include |
| Hierarchy | dim_region.province, dim_region.city | Hierarchy dimension |
| Joint | dim_channel | Joint dimension |
| Normal | dim_product.category, dim_product.brand | Normal dimension |
Aggregation Group
Group 1: dt, region_id (high-frequency combination) Group 2: dt, product_id (product analysis) Group 3: dt, channel_id (channel analysis)
Cube Build
Full Build
# Build via Kylin Web UI or REST API
curl -X PUT --user ADMIN:KYLIN \
-H "Content-Type:application/json;charset=utf-8" \
-d '{"startTime":"","endTime":"","buildType":"BUILD"}' \
http://h122.wzk.icu:7070/kylin/api/cubes/ecommerce_cube/build
Build Flow
- Step 1: Extract Fact Table - Read Hive fact table
- Step 2: Build Dimension Dictionary - Build dimension dictionary
- Step 3: Build Cube - Compute each Cuboid
- Step 4: Convert Cube to HBase - Convert to HBase storage
- Step 5: Checkpoint - Metadata persistence
Query Examples
SQL Queries
-- Sales by date + region
SELECT dt, region_name, SUM(amount) AS total_amount
FROM fact_sales
JOIN dim_region ON fact_sales.region_id = dim_region.region_id
WHERE dt >= '2024-01-01' AND dt <= '2024-01-31'
GROUP BY dt, region_name;
-- Monthly summary by product category
SELECT SUBSTR(dt, 1, 7) AS month, category,
SUM(amount) AS total, COUNT(*) AS orders
FROM fact_sales
JOIN dim_product ON fact_sales.product_id = dim_product.product_id
GROUP BY SUBSTR(dt, 1, 7), category
ORDER BY month, total DESC;
Query Hit Analysis
View via Kylin Query page:
- Hit Cuboid: Which pre-computed Cuboid was used
- Query Time: Whether completed in millisecond level
- Scan Rows: Number of rows scanned in HBase
Real-time OLAP (Kafka)
Kylin 4.0 supports Kafka real-time stream intake:
- Create Streaming Cube: Select Kafka as data source
- Configure message format: JSON/Avro
- Set time window: Incremental build interval (minute level)
- Query: Same syntax as offline Cube query
{
"dimensions": {"region": "APAC", "product": "phone"},
"metrics": {"revenue": 12500, "orders": 42},
"timestamp": "2024-01-15T10:30:00Z"
}
Error Quick Reference
| Symptom | Root Cause Location | Fix |
|---|---|---|
| Build fails, dimension table primary key duplicate | Dimension table has duplicate primary keys | Preprocess dedup, ensure primary key unique |
| Build extremely slow, too many Cuboids | Aggregation Group not configured | Split aggregation groups, configure mandatory/hierarchy dimensions |
| Query misses, falls back to Hive | Query dimension combination not pre-computed | Add Cuboid or adjust query |
| HBase storage expansion >1000% | High cardinality dimensions + no pruning | Use dictionary encoding, optimize dimension combinations |
| Real-time build latency high | Kafka consumption lag | Adjust batch processing parameters, increase concurrency |
| Dimension table OOM | Dimension table too large (>300MB) | Split dimension table, reduce column count |