TL;DR
- Scenario: Business data adds daily, need incremental Cube build instead of full recalculation
- Conclusion: Incremental build significantly reduces build time, automatically aggregates across Segments at query time
- Output: Complete incremental build flow, Segment management strategy, query difference comparison
Core Concepts of Incremental Build
Why Need Incremental Build
In actual business scenarios, data in Hive tables usually grows continuously:
- Daily new partition data
- Historical data basically unchanged
- Full build would recalculate historical data, wasting resources
Incremental build allows processing only new time range data, avoiding repeated computation of historical data.
What is Segment
Kylin divides Cube into multiple Segments by time range:
- Each Segment corresponds to an HBase table
- Each Segment represents pre-computed data for a specific time period
- At query time, Kylin automatically aggregates results from multiple Segments
Incremental Build Flow
1. Create Partition Table
CREATE TABLE fact_sales (
order_id STRING,
region_id STRING,
product_id STRING,
amount DECIMAL(18,2),
quantity INT
) PARTITIONED BY (dt STRING) -- date partition
STORED AS PARQUET;
-- Insert test data
INSERT INTO fact_sales PARTITION (dt='2024-01-01')
SELECT 'o001', 'r001', 'p001', 100.0, 1;
INSERT INTO fact_sales PARTITION (dt='2024-01-02')
SELECT 'o002', 'r002', 'p002', 200.0, 2;
2. Load Table in Kylin
# Refresh metadata, load new table
curl -X POST --user ADMIN:KYLIN \
http://h122.wzk.icu:7070/kylin/api/tables/refresh
3. Create Model (with Partition Field)
Create Model in Kylin Web UI:
- Fact table: fact_sales
- Partition field: dt (Partition Date Column)
- Dimensions: region_id, product_id
- Measures: SUM(amount), SUM(quantity)
4. Create Cube
Create Cube based on Model, same config as regular Cube.
5. Incremental Build Segment
First Build (Full Build)
# First full build
curl -X PUT --user ADMIN:KYLIN \
-H "Content-Type:application/json;charset=utf-8" \
-d '{
"startTime": "",
"endTime": "20240101000000",
"buildType": "BUILD"
}' \
http://h122.wzk.icu:7070/kylin/api/cubes/ecommerce_cube/build
Second Incremental Build
# Incremental build for second day data
curl -X PUT --user ADMIN:KYLIN \
-H "Content-Type:application/json;charset=utf-8" \
-d '{
"startTime": "20240101000000",
"endTime": "20240102000000",
"buildType": "BUILD"
}' \
http://h122.wzk.icu:7070/kylin/api/cubes/ecommerce_cube/build
Third Incremental Build
# Incremental build for third day data
curl -X PUT --user ADMIN:KYLIN \
-H "Content-Type:application/json;charset=utf-8" \
-d '{
"startTime": "20240102000000",
"endTime": "20240103000000",
"buildType": "BUILD"
}' \
http://h122.wzk.icu:7070/kylin/api/cubes/ecommerce_cube/build
Full vs Incremental Comparison
Build Differences
| Feature | Full Build | Incremental Build |
|---|---|---|
| Segment count | 1 | Multiple (increases with builds) |
| Build time | Long (full calculation each time) | Short (only new data) |
| Storage | Centralized | Distributed across Segments |
| Historical data | Recalculated each time | Remains unchanged |
Query Differences
Full Build Query:
- Only involves one Segment
- Returns result directly
Incremental Build Query:
- Involves multiple Segments
- Kylin performs runtime aggregation at query time
- Example: Query Jan 1-3 sum = Segment1 + Segment2 + Segment3
-- After incremental build, query syntax unchanged
SELECT dt, SUM(amount)
FROM fact_sales
WHERE dt BETWEEN '2024-01-01' AND '2024-01-03'
GROUP BY dt;
Segment Management
View Segments
In Kylin Web UI Model page, can see all Segments of Cube:
- Segment name
- Time range (Start Time ~ End Time)
- Status (NEW/READY/ERROR)
- Size
Segment Merge
Multiple small Segments can be merged into one large Segment to reduce aggregation overhead at query time:
# Merge Segments
curl -X PUT --user ADMIN:KYLIN \
-H "Content-Type:application/json;charset=utf-8" \
-d '{
"mergeStartTime": "20240101000000",
"mergeEndTime": "20240103000000"
}' \
http://h122.wzk.icu:7070/kylin/api/cubes/ecommerce_cube/merge
Auto Merge Strategy
Configure in Cube Designer Refresh Settings:
- Auto Merge Thresholds: 7 days, 28 days
- Automatically triggers merge when new Segment becomes READY
Error Quick Reference
| Symptom | Root Cause Location | Fix |
|---|---|---|
| Incremental build fails | Partition data doesn’t exist | Confirm Hive partition created |
| Query result inaccurate | Segment time ranges overlap | Check if build time ranges are continuous |
| Query becomes slow | Too many Segments | Configure auto merge, reduce Segment count |
| Build error no permission | Kerberos/permission issue | Check Hive/HBase access permissions |
| Segment status ERROR | Build process exception | Check Job log to locate issue |