TL;DR

  • Scenario: Real-time data stream (Kafka) connected to Kylin, achieving minute-level OLAP analysis
  • Conclusion: Streaming Cubing supports real-time aggregation queries with 3-5 minute latency
  • Output: Kafka config, message format, build command, scheduled execution

Streaming Cubing Overview

What is Streaming Cubing

Apache Kylin V1.6 released Streaming Cubing feature for real-time data update requirements, integrates Hadoop ecosystem processing capability to consume Kafka message queue data in real-time, achieves minute-level (typically 3-5 minutes) updated Cube.

Core Features

  • Consume Kafka message queue
  • Minute-level data updates
  • Same query syntax as offline Cube
  • Supports Lambda architecture (realtime + batch)

Implementation Steps

1. Create Project

Create project in Kylin Web UI.

2. Define Data Source (Kafka)

Data Source → Add Streaming Table

3. Define Model

Model → Streaming Model

4. Define Cube

Cube → Streaming Cube

5. Build Cube

Build → Streaming Build

6. Job Scheduling

Use crontab to schedule incremental build periodically.


Message Body JSON Structure

Requirements

{
  "dimensions": {
    "region": "APAC",
    "product_line": "smartphone"
  },
  "metrics": {
    "order_count": 42,
    "revenue": 12500.00
  },
  "timestamp": "2023-05-15T09:15:30Z"
}

Three-part JSON

  1. dimensions: Dimension fields
  2. metrics: Measure fields
  3. timestamp: Timestamp

Time Partition Column

Place time partition column (like minute_start) before RowKeys.


Build Commands

REST API Build

curl -X PUT --user ADMIN:KYLIN \
  -H "Content-Type:application/json;charset=utf-8" \
  -d '{
    "sourceOffsetStart": 0,
    "sourceOffsetEnd": 9223372036854775807,
    "buildType": "BUILD"
  }' \
  http://h122.wzk.icu:7070/kylin/api/cubes/streaming_cube1/build2

Parameter Description

  • sourceOffsetStart: Kafka start offset
  • sourceOffsetEnd: Kafka end offset
  • buildType: BUILD (incremental)

Auto Build (crontab)

Every 20 minutes incremental build

*/20 * * * * curl -X PUT --user ADMIN:KYLIN \
  -H "Content-Type:application/json;charset=utf-8" \
  -d '{
    "sourceOffsetStart": 0,
    "sourceOffsetEnd": 9223372036854775807,
    "buildType": "BUILD"
  }' \
  http://h122.wzk.icu:7070/kylin/api/cubes/streaming_cube1/build2

Micro-batch Build & Refresh Merge Window

Window TypeTime Interval
0.5h30 minutes
4h4 hours
1d1 day
7d7 days

Typical Application Scenarios

E-commerce Transaction Analysis

  • Real-time sales statistics
  • Order volume monitoring
  • Regional sales ranking

User Behavior Analysis

  • Real-time UV/PV
  • Click stream analysis
  • User profile real-time updates

IoT Device Monitoring

  • Sensor data aggregation
  • Anomaly alerts
  • Device status real-time statistics

Parser Config

TimedJsonStreamParser

Use TimedJsonStreamParser to parse three-part JSON:

  • Automatically extract dimensions
  • Automatically extract metrics
  • Automatically extract timestamp

Time Partition

  • minute_start: Minute-level partition
  • hour_start: Hour-level partition
  • day_start: Day-level partition

Error Quick Reference

SymptomRoot Cause LocationFix
Kafka consumption lagNetwork/partitions/consumer capacityCheck consumer concurrency, increase partition count
Data lossOffset commit strategyEnsure offsets.commit.interval.ms config
Build failsJSON format errorValidate message body format
Query latency highToo many SegmentsConfigure auto merge strategy