TL;DR
- Scenario: Compare size and precision differences between Cube4 (optimized) and Cube7 (unoptimized)
- Conclusion: Cube7 size significantly larger than Cube4, more dimensions = more significant Cuboid exponential growth
- Output: Aggregation group config, RowKey design, encoding selection, sharding strategy
Cube7 vs Cube4 Comparison
Experiment Design
- Cube4: Uses aggregation group, mandatory dimension, hierarchy dimension, joint dimension optimization
- Cube7: All dimensions set to Normal, no optimization
- Data volume: Same
Results
| Cube | Cuboid Count | Size | Expansion Rate |
|---|---|---|---|
| Cube4 (optimized) | Less | Smaller | < 500% |
| Cube7 (unoptimized) | Exponential growth | Larger | > 1000% |
Aggregation Group
Purpose
As dimension count increases, Cuboids grow exponentially. Aggregation groups can effectively control pre-computation scale.
Default Behavior
By default, all dimensions are in the same aggregation group.
Suggestion
Split into multiple aggregation groups when dimension >15.
Config Example
Group 1: dt, region_id, product_id
Group 2: dt, channel_id, category_id
Mandatory Dimension
Purpose
Dimensions that always appear in WHERE or GROUP BY. After config, reduces Cuboids not containing that dimension.
Config
Check Mandatory in Kylin Cube Designer.
Examples
- dt (date): Almost all queries filter by date
- status (status): Always used as filter condition
Hierarchy Dimension
Purpose
Dimensions with hierarchical relationships (like country→province→city), only pre-compute finest granularity combination for same hierarchy.
Config
Set hierarchy dimension in Cube Designer:
- dim_region.country
- dim_region.province
- dim_region.city
Effect
For {country, province, city} hierarchy dimension:
- Pre-compute: country, province, city (finest granularity)
- Don’t pre-compute: country+province, province+city (included)
Joint Dimension
Purpose
Treats multiple dimensions as one, either appear together or both don’t appear.
Applicable Scenarios
- Dimensions always queried together
- Low cardinality dimension combinations
Config
Check Joint in Cube Designer.
Examples
- {province, city}: Always used together
- {channel_type, channel_name}: Low cardinality combination
RowKey Design
Mapping Relationship
- Cuboid dimensions → HBase Rowkey
- Measures → HBase Value
Encoding Methods
| Encoding | Applicable Scenario |
|---|---|
| Dictionary | Low cardinality enumerated values, automatically map to bytes |
| boolean | true/false |
| date | Date type |
| time | Time type |
| fixed_length | Fixed length string |
RowKey Order Suggestions
- Mandatory - Mandatory dimensions first
- High-frequency filter - Frequently appear in WHERE conditions
- High cardinality - High cardinality dimensions
- Low cardinality - Low cardinality dimensions
Example
RowKey order: dt (Mandatory) > region_id (high-frequency) > product_id (high cardinality) > category_id (low cardinality)
Sharding (ShardBy)
Purpose
Select high cardinality column as sharding column to evenly distribute data, improve parallelism.
Config
Set ShardBy field in Cube Designer.
Examples
- product_id: High cardinality, good distribution effect
- order_id: Unique ID, perfect sharding
Precision/Sparsity Analysis
CubeStatsReader Output
kylin.sh org.apache.kylin.engine.mr.common.CubeStatsReader cube_name
Output Metrics
- HII Precision: Estimated precision
- Total Cuboids: Total Cuboid count
- Total Rows: Total row count
- Total Size: Total size
- Per Cuboid details: Each Cuboid’s row count/size
Error Quick Reference
| Symptom | Root Cause Location | Fix |
|---|---|---|
| Slow build | Too many Cuboids | Configure aggregation group, mandatory/hierarchy/joint dimensions |
| Slow query | Improper RowKey order | Adjust encoding and order |
| Large storage | Dictionary encoding not used/sharding uneven | Optimize encoding, select appropriate sharding column |
| Low hit rate | Query pattern doesn’t match Cube design | Analyze query logs, adjust dimension config |
| OOM | Dimension combination explosion | Split aggregation groups, reduce dimensions |