TL;DR

  • Scenario: Compare size and precision differences between Cube4 (optimized) and Cube7 (unoptimized)
  • Conclusion: Cube7 size significantly larger than Cube4, more dimensions = more significant Cuboid exponential growth
  • Output: Aggregation group config, RowKey design, encoding selection, sharding strategy

Cube7 vs Cube4 Comparison

Experiment Design

  • Cube4: Uses aggregation group, mandatory dimension, hierarchy dimension, joint dimension optimization
  • Cube7: All dimensions set to Normal, no optimization
  • Data volume: Same

Results

CubeCuboid CountSizeExpansion Rate
Cube4 (optimized)LessSmaller< 500%
Cube7 (unoptimized)Exponential growthLarger> 1000%

Aggregation Group

Purpose

As dimension count increases, Cuboids grow exponentially. Aggregation groups can effectively control pre-computation scale.

Default Behavior

By default, all dimensions are in the same aggregation group.

Suggestion

Split into multiple aggregation groups when dimension >15.

Config Example

Group 1: dt, region_id, product_id
Group 2: dt, channel_id, category_id

Mandatory Dimension

Purpose

Dimensions that always appear in WHERE or GROUP BY. After config, reduces Cuboids not containing that dimension.

Config

Check Mandatory in Kylin Cube Designer.

Examples

  • dt (date): Almost all queries filter by date
  • status (status): Always used as filter condition

Hierarchy Dimension

Purpose

Dimensions with hierarchical relationships (like country→province→city), only pre-compute finest granularity combination for same hierarchy.

Config

Set hierarchy dimension in Cube Designer:

  • dim_region.country
  • dim_region.province
  • dim_region.city

Effect

For {country, province, city} hierarchy dimension:

  • Pre-compute: country, province, city (finest granularity)
  • Don’t pre-compute: country+province, province+city (included)

Joint Dimension

Purpose

Treats multiple dimensions as one, either appear together or both don’t appear.

Applicable Scenarios

  • Dimensions always queried together
  • Low cardinality dimension combinations

Config

Check Joint in Cube Designer.

Examples

  • {province, city}: Always used together
  • {channel_type, channel_name}: Low cardinality combination

RowKey Design

Mapping Relationship

  • Cuboid dimensions → HBase Rowkey
  • Measures → HBase Value

Encoding Methods

EncodingApplicable Scenario
DictionaryLow cardinality enumerated values, automatically map to bytes
booleantrue/false
dateDate type
timeTime type
fixed_lengthFixed length string

RowKey Order Suggestions

  1. Mandatory - Mandatory dimensions first
  2. High-frequency filter - Frequently appear in WHERE conditions
  3. High cardinality - High cardinality dimensions
  4. Low cardinality - Low cardinality dimensions

Example

RowKey order: dt (Mandatory) > region_id (high-frequency) > product_id (high cardinality) > category_id (low cardinality)

Sharding (ShardBy)

Purpose

Select high cardinality column as sharding column to evenly distribute data, improve parallelism.

Config

Set ShardBy field in Cube Designer.

Examples

  • product_id: High cardinality, good distribution effect
  • order_id: Unique ID, perfect sharding

Precision/Sparsity Analysis

CubeStatsReader Output

kylin.sh org.apache.kylin.engine.mr.common.CubeStatsReader cube_name

Output Metrics

  • HII Precision: Estimated precision
  • Total Cuboids: Total Cuboid count
  • Total Rows: Total row count
  • Total Size: Total size
  • Per Cuboid details: Each Cuboid’s row count/size

Error Quick Reference

SymptomRoot Cause LocationFix
Slow buildToo many CuboidsConfigure aggregation group, mandatory/hierarchy/joint dimensions
Slow queryImproper RowKey orderAdjust encoding and order
Large storageDictionary encoding not used/sharding unevenOptimize encoding, select appropriate sharding column
Low hit rateQuery pattern doesn’t match Cube designAnalyze query logs, adjust dimension config
OOMDimension combination explosionSplit aggregation groups, reduce dimensions