TL;DR

  • Scenario: Business data adds daily, need incremental Cube build instead of full recalculation
  • Conclusion: Incremental build significantly reduces build time, automatically aggregates across Segments at query time
  • Output: Complete incremental build flow, Segment management strategy, query difference comparison

Core Concepts of Incremental Build

Why Need Incremental Build

In actual business scenarios, data in Hive tables usually grows continuously:

  • Daily new partition data
  • Historical data basically unchanged
  • Full build would recalculate historical data, wasting resources

Incremental build allows processing only new time range data, avoiding repeated computation of historical data.

What is Segment

Kylin divides Cube into multiple Segments by time range:

  • Each Segment corresponds to an HBase table
  • Each Segment represents pre-computed data for a specific time period
  • At query time, Kylin automatically aggregates results from multiple Segments

Incremental Build Flow

1. Create Partition Table

CREATE TABLE fact_sales (
  order_id STRING,
  region_id STRING,
  product_id STRING,
  amount DECIMAL(18,2),
  quantity INT
) PARTITIONED BY (dt STRING)  -- date partition
STORED AS PARQUET;

-- Insert test data
INSERT INTO fact_sales PARTITION (dt='2024-01-01')
SELECT 'o001', 'r001', 'p001', 100.0, 1;

INSERT INTO fact_sales PARTITION (dt='2024-01-02')
SELECT 'o002', 'r002', 'p002', 200.0, 2;

2. Load Table in Kylin

# Refresh metadata, load new table
curl -X POST --user ADMIN:KYLIN \
  http://h122.wzk.icu:7070/kylin/api/tables/refresh

3. Create Model (with Partition Field)

Create Model in Kylin Web UI:

  • Fact table: fact_sales
  • Partition field: dt (Partition Date Column)
  • Dimensions: region_id, product_id
  • Measures: SUM(amount), SUM(quantity)

4. Create Cube

Create Cube based on Model, same config as regular Cube.

5. Incremental Build Segment

First Build (Full Build)

# First full build
curl -X PUT --user ADMIN:KYLIN \
  -H "Content-Type:application/json;charset=utf-8" \
  -d '{
    "startTime": "",
    "endTime": "20240101000000",
    "buildType": "BUILD"
  }' \
  http://h122.wzk.icu:7070/kylin/api/cubes/ecommerce_cube/build

Second Incremental Build

# Incremental build for second day data
curl -X PUT --user ADMIN:KYLIN \
  -H "Content-Type:application/json;charset=utf-8" \
  -d '{
    "startTime": "20240101000000",
    "endTime": "20240102000000",
    "buildType": "BUILD"
  }' \
  http://h122.wzk.icu:7070/kylin/api/cubes/ecommerce_cube/build

Third Incremental Build

# Incremental build for third day data
curl -X PUT --user ADMIN:KYLIN \
  -H "Content-Type:application/json;charset=utf-8" \
  -d '{
    "startTime": "20240102000000",
    "endTime": "20240103000000",
    "buildType": "BUILD"
  }' \
  http://h122.wzk.icu:7070/kylin/api/cubes/ecommerce_cube/build

Full vs Incremental Comparison

Build Differences

FeatureFull BuildIncremental Build
Segment count1Multiple (increases with builds)
Build timeLong (full calculation each time)Short (only new data)
StorageCentralizedDistributed across Segments
Historical dataRecalculated each timeRemains unchanged

Query Differences

Full Build Query:

  • Only involves one Segment
  • Returns result directly

Incremental Build Query:

  • Involves multiple Segments
  • Kylin performs runtime aggregation at query time
  • Example: Query Jan 1-3 sum = Segment1 + Segment2 + Segment3
-- After incremental build, query syntax unchanged
SELECT dt, SUM(amount)
FROM fact_sales
WHERE dt BETWEEN '2024-01-01' AND '2024-01-03'
GROUP BY dt;

Segment Management

View Segments

In Kylin Web UI Model page, can see all Segments of Cube:

  • Segment name
  • Time range (Start Time ~ End Time)
  • Status (NEW/READY/ERROR)
  • Size

Segment Merge

Multiple small Segments can be merged into one large Segment to reduce aggregation overhead at query time:

# Merge Segments
curl -X PUT --user ADMIN:KYLIN \
  -H "Content-Type:application/json;charset=utf-8" \
  -d '{
    "mergeStartTime": "20240101000000",
    "mergeEndTime": "20240103000000"
  }' \
  http://h122.wzk.icu:7070/kylin/api/cubes/ecommerce_cube/merge

Auto Merge Strategy

Configure in Cube Designer Refresh Settings:

  • Auto Merge Thresholds: 7 days, 28 days
  • Automatically triggers merge when new Segment becomes READY

Error Quick Reference

SymptomRoot Cause LocationFix
Incremental build failsPartition data doesn’t existConfirm Hive partition created
Query result inaccurateSegment time ranges overlapCheck if build time ranges are continuous
Query becomes slowToo many SegmentsConfigure auto merge, reduce Segment count
Build error no permissionKerberos/permission issueCheck Hive/HBase access permissions
Segment status ERRORBuild process exceptionCheck Job log to locate issue