TL;DR
- Scenario: Real-time data stream (Kafka) connected to Kylin, achieving minute-level OLAP analysis
- Conclusion: Streaming Cubing supports real-time aggregation queries with 3-5 minute latency
- Output: Kafka config, message format, build command, scheduled execution
Streaming Cubing Overview
What is Streaming Cubing
Apache Kylin V1.6 released Streaming Cubing feature for real-time data update requirements, integrates Hadoop ecosystem processing capability to consume Kafka message queue data in real-time, achieves minute-level (typically 3-5 minutes) updated Cube.
Core Features
- Consume Kafka message queue
- Minute-level data updates
- Same query syntax as offline Cube
- Supports Lambda architecture (realtime + batch)
Implementation Steps
1. Create Project
Create project in Kylin Web UI.
2. Define Data Source (Kafka)
Data Source → Add Streaming Table
3. Define Model
Model → Streaming Model
4. Define Cube
Cube → Streaming Cube
5. Build Cube
Build → Streaming Build
6. Job Scheduling
Use crontab to schedule incremental build periodically.
Message Body JSON Structure
Requirements
{
"dimensions": {
"region": "APAC",
"product_line": "smartphone"
},
"metrics": {
"order_count": 42,
"revenue": 12500.00
},
"timestamp": "2023-05-15T09:15:30Z"
}
Three-part JSON
- dimensions: Dimension fields
- metrics: Measure fields
- timestamp: Timestamp
Time Partition Column
Place time partition column (like minute_start) before RowKeys.
Build Commands
REST API Build
curl -X PUT --user ADMIN:KYLIN \
-H "Content-Type:application/json;charset=utf-8" \
-d '{
"sourceOffsetStart": 0,
"sourceOffsetEnd": 9223372036854775807,
"buildType": "BUILD"
}' \
http://h122.wzk.icu:7070/kylin/api/cubes/streaming_cube1/build2
Parameter Description
- sourceOffsetStart: Kafka start offset
- sourceOffsetEnd: Kafka end offset
- buildType: BUILD (incremental)
Auto Build (crontab)
Every 20 minutes incremental build
*/20 * * * * curl -X PUT --user ADMIN:KYLIN \
-H "Content-Type:application/json;charset=utf-8" \
-d '{
"sourceOffsetStart": 0,
"sourceOffsetEnd": 9223372036854775807,
"buildType": "BUILD"
}' \
http://h122.wzk.icu:7070/kylin/api/cubes/streaming_cube1/build2
Micro-batch Build & Refresh Merge Window
| Window Type | Time Interval |
|---|---|
| 0.5h | 30 minutes |
| 4h | 4 hours |
| 1d | 1 day |
| 7d | 7 days |
Typical Application Scenarios
E-commerce Transaction Analysis
- Real-time sales statistics
- Order volume monitoring
- Regional sales ranking
User Behavior Analysis
- Real-time UV/PV
- Click stream analysis
- User profile real-time updates
IoT Device Monitoring
- Sensor data aggregation
- Anomaly alerts
- Device status real-time statistics
Parser Config
TimedJsonStreamParser
Use TimedJsonStreamParser to parse three-part JSON:
- Automatically extract dimensions
- Automatically extract metrics
- Automatically extract timestamp
Time Partition
- minute_start: Minute-level partition
- hour_start: Hour-level partition
- day_start: Day-level partition
Error Quick Reference
| Symptom | Root Cause Location | Fix |
|---|---|---|
| Kafka consumption lag | Network/partitions/consumer capacity | Check consumer concurrency, increase partition count |
| Data loss | Offset commit strategy | Ensure offsets.commit.interval.ms config |
| Build fails | JSON format error | Validate message body format |
| Query latency high | Too many Segments | Configure auto merge strategy |