Apache Kylin Cuboid Pruning Practice: Derived Dimensions ...

TL;DR

Scenario: Cube has many dimensions, Cuboid exponential growth causes slow build and storage expansion
Conclusion: Through derived dimensions + aggregation groups + expansion rate monitoring, can effectively control Cuboid count
Output: Cuboid inspection tool, derived dimension config, expansion rate analysis, error quick reference

Cuboid Pruning Overview

What is Cuboid

Cuboid specifically refers to all data computed under one dimension combination in Kylin. Optimization aimed at reducing Cuboid count is collectively called Cuboid pruning.

Cuboid Count Calculation

4 dimensions: 2^4 = 16 Cuboids
10 dimensions: 2^10 = 1024 Cuboids
20 dimensions: 2^20 = 1048576 Cuboids

Pruning Principles

Skip Cuboids that will never be queried
Skip Cuboids with capabilities close to other Cuboids
Prerequisite: Don’t significantly affect query time

Check Cuboid Count

CubeStatsReader Tool

# Enter Kylin installation directory
cd $KYLIN_HOME

# Execute Cuboid statistics
kylin.sh org.apache.kylin.engine.mr.common.CubeStatsReader wzk_kylin_test_cube_4

Output Results

Estimated Cuboid size precision (HII Precision)
Total Cuboid count
Estimated total rows in Segment
Estimated Segment size
All Cuboids and analysis results

Check Cube Size (Web GUI)

Select a Cube in READY status on Model page:

Cube Size: Actual Cube size
Expansion Rate: Expansion rate = Cube size / source data size

Expansion Rate Reference:

0% ~ 1000%: Normal range
> 1000%: Needs optimization

Derived Dimension

Principle

Set non-primary key fields of dimension table as derived dimensions:

Don’t participate in Cuboid computation
Use dimension table primary key instead
Dynamically translate at query time

Applicable Scenarios

Dimension table fields have clear mapping with primary key
Not much aggregation work

Not Applicable Scenarios

Primary key to dimension requires heavy aggregation
Affects query performance

Create Derived Dimension

In Kylin Cube Designer:

Select dimension table
Find column to set as derived
Check Derived
Save Cube

Case Comparison

Date dimension table dim_date:

Field	Description
dateid	Primary key
dayofyear	Day of year
dayofmonth	Day of month
day_in_year	Day of year (alternative)
day_in_month	Day of month (alternative)
weekday	Day of week
week_in_month	Week of month
week_in_year	Week of year
date_type	Date type
quarter	Quarter

Note: Date dimension fields cannot be year, month, etc., need to be fields in dim_date table.

Comparison Results

Cube Type	Cuboid Count	Build Time	Storage Size
Normal dimension	More	Longer	Larger
Derived dimension	Significantly less	Shorter	Smaller

Reasons for High Expansion Rate

Many dimensions, no Cuboid pruning
High cardinality dimensions exist (like user ID, order ID)
Space-consuming measures exist (like Count Distinct)

Error Quick Reference

Symptom	Root Cause Location	Fix
Expansion rate > 1000%	Too many Cuboids/high cardinality dimensions/heavy Count Distinct	Check GUI size and expansion rate; use StatsReader to view size/row distribution; disable unhit/approximate Cuboids
Build extremely slow/fails	Dimension combination explosion/resources and splits mismatch	Check build Job logs; prune strongly based on Segment size and Mapper/Reducer count
StatsReader no output	Tool classpath and version mismatch	Execute in $KYLIN_HOME; verify class name in distribution package
Query becomes slow (after enabling Derived)	Query-side secondary aggregation cost increases	Check query plan/Profile; change high-frequency dimensions back to normal dimensions
Some Cuboids never hit	Design doesn’t match query patterns	Check query logs/hit statistics; disable that combination in Cube design