TL;DR

  • Scenario: Cube has many dimensions, Cuboid exponential growth causes slow build and storage expansion
  • Conclusion: Through derived dimensions + aggregation groups + expansion rate monitoring, can effectively control Cuboid count
  • Output: Cuboid inspection tool, derived dimension config, expansion rate analysis, error quick reference

Cuboid Pruning Overview

What is Cuboid

Cuboid specifically refers to all data computed under one dimension combination in Kylin. Optimization aimed at reducing Cuboid count is collectively called Cuboid pruning.

Cuboid Count Calculation

4 dimensions: 2^4 = 16 Cuboids
10 dimensions: 2^10 = 1024 Cuboids
20 dimensions: 2^20 = 1048576 Cuboids

Pruning Principles

  • Skip Cuboids that will never be queried
  • Skip Cuboids with capabilities close to other Cuboids
  • Prerequisite: Don’t significantly affect query time

Check Cuboid Count

CubeStatsReader Tool

# Enter Kylin installation directory
cd $KYLIN_HOME

# Execute Cuboid statistics
kylin.sh org.apache.kylin.engine.mr.common.CubeStatsReader wzk_kylin_test_cube_4

Output Results

  • Estimated Cuboid size precision (HII Precision)
  • Total Cuboid count
  • Estimated total rows in Segment
  • Estimated Segment size
  • All Cuboids and analysis results

Check Cube Size (Web GUI)

Select a Cube in READY status on Model page:

  • Cube Size: Actual Cube size
  • Expansion Rate: Expansion rate = Cube size / source data size

Expansion Rate Reference:

  • 0% ~ 1000%: Normal range
  • > 1000%: Needs optimization

Derived Dimension

Principle

Set non-primary key fields of dimension table as derived dimensions:

  • Don’t participate in Cuboid computation
  • Use dimension table primary key instead
  • Dynamically translate at query time

Applicable Scenarios

  • Dimension table fields have clear mapping with primary key
  • Not much aggregation work

Not Applicable Scenarios

  • Primary key to dimension requires heavy aggregation
  • Affects query performance

Create Derived Dimension

In Kylin Cube Designer:

  1. Select dimension table
  2. Find column to set as derived
  3. Check Derived
  4. Save Cube

Case Comparison

Date dimension table dim_date:

FieldDescription
dateidPrimary key
dayofyearDay of year
dayofmonthDay of month
day_in_yearDay of year (alternative)
day_in_monthDay of month (alternative)
weekdayDay of week
week_in_monthWeek of month
week_in_yearWeek of year
date_typeDate type
quarterQuarter

Note: Date dimension fields cannot be year, month, etc., need to be fields in dim_date table.

Comparison Results

Cube TypeCuboid CountBuild TimeStorage Size
Normal dimensionMoreLongerLarger
Derived dimensionSignificantly lessShorterSmaller

Reasons for High Expansion Rate

  1. Many dimensions, no Cuboid pruning
  2. High cardinality dimensions exist (like user ID, order ID)
  3. Space-consuming measures exist (like Count Distinct)

Error Quick Reference

SymptomRoot Cause LocationFix
Expansion rate > 1000%Too many Cuboids/high cardinality dimensions/heavy Count DistinctCheck GUI size and expansion rate; use StatsReader to view size/row distribution; disable unhit/approximate Cuboids
Build extremely slow/failsDimension combination explosion/resources and splits mismatchCheck build Job logs; prune strongly based on Segment size and Mapper/Reducer count
StatsReader no outputTool classpath and version mismatchExecute in $KYLIN_HOME; verify class name in distribution package
Query becomes slow (after enabling Derived)Query-side secondary aggregation cost increasesCheck query plan/Profile; change high-frequency dimensions back to normal dimensions
Some Cuboids never hitDesign doesn’t match query patternsCheck query logs/hit statistics; disable that combination in Cube design