TL;DR
- Scenario: Cube has many dimensions, Cuboid exponential growth causes slow build and storage expansion
- Conclusion: Through derived dimensions + aggregation groups + expansion rate monitoring, can effectively control Cuboid count
- Output: Cuboid inspection tool, derived dimension config, expansion rate analysis, error quick reference
Cuboid Pruning Overview
What is Cuboid
Cuboid specifically refers to all data computed under one dimension combination in Kylin. Optimization aimed at reducing Cuboid count is collectively called Cuboid pruning.
Cuboid Count Calculation
4 dimensions: 2^4 = 16 Cuboids
10 dimensions: 2^10 = 1024 Cuboids
20 dimensions: 2^20 = 1048576 Cuboids
Pruning Principles
- Skip Cuboids that will never be queried
- Skip Cuboids with capabilities close to other Cuboids
- Prerequisite: Don’t significantly affect query time
Check Cuboid Count
CubeStatsReader Tool
# Enter Kylin installation directory
cd $KYLIN_HOME
# Execute Cuboid statistics
kylin.sh org.apache.kylin.engine.mr.common.CubeStatsReader wzk_kylin_test_cube_4
Output Results
- Estimated Cuboid size precision (HII Precision)
- Total Cuboid count
- Estimated total rows in Segment
- Estimated Segment size
- All Cuboids and analysis results
Check Cube Size (Web GUI)
Select a Cube in READY status on Model page:
- Cube Size: Actual Cube size
- Expansion Rate: Expansion rate = Cube size / source data size
Expansion Rate Reference:
- 0% ~ 1000%: Normal range
- > 1000%: Needs optimization
Derived Dimension
Principle
Set non-primary key fields of dimension table as derived dimensions:
- Don’t participate in Cuboid computation
- Use dimension table primary key instead
- Dynamically translate at query time
Applicable Scenarios
- Dimension table fields have clear mapping with primary key
- Not much aggregation work
Not Applicable Scenarios
- Primary key to dimension requires heavy aggregation
- Affects query performance
Create Derived Dimension
In Kylin Cube Designer:
- Select dimension table
- Find column to set as derived
- Check Derived
- Save Cube
Case Comparison
Date dimension table dim_date:
| Field | Description |
|---|---|
| dateid | Primary key |
| dayofyear | Day of year |
| dayofmonth | Day of month |
| day_in_year | Day of year (alternative) |
| day_in_month | Day of month (alternative) |
| weekday | Day of week |
| week_in_month | Week of month |
| week_in_year | Week of year |
| date_type | Date type |
| quarter | Quarter |
Note: Date dimension fields cannot be year, month, etc., need to be fields in dim_date table.
Comparison Results
| Cube Type | Cuboid Count | Build Time | Storage Size |
|---|---|---|---|
| Normal dimension | More | Longer | Larger |
| Derived dimension | Significantly less | Shorter | Smaller |
Reasons for High Expansion Rate
- Many dimensions, no Cuboid pruning
- High cardinality dimensions exist (like user ID, order ID)
- Space-consuming measures exist (like Count Distinct)
Error Quick Reference
| Symptom | Root Cause Location | Fix |
|---|---|---|
| Expansion rate > 1000% | Too many Cuboids/high cardinality dimensions/heavy Count Distinct | Check GUI size and expansion rate; use StatsReader to view size/row distribution; disable unhit/approximate Cuboids |
| Build extremely slow/fails | Dimension combination explosion/resources and splits mismatch | Check build Job logs; prune strongly based on Segment size and Mapper/Reducer count |
| StatsReader no output | Tool classpath and version mismatch | Execute in $KYLIN_HOME; verify class name in distribution package |
| Query becomes slow (after enabling Derived) | Query-side secondary aggregation cost increases | Check query plan/Profile; change high-frequency dimensions back to normal dimensions |
| Some Cuboids never hit | Design doesn’t match query patterns | Check query logs/hit statistics; disable that combination in Cube design |