TL;DR

  • Scenario: E-commerce sales data multi-dimensional analysis, Cube modeling and pre-computation by date/region/product/channel
  • Conclusion: Star modeling + Aggregation Group pruning can significantly reduce Cuboid count, queries achieve second-level response
  • Output: Complete modeling flow, dimension design principles, Cuboid optimization strategies, error quick reference table

Version Matrix

ModuleTechnical PointsEngineering Suggestions
Star ModelingFact table + dimension table, single table ≤300MBDimension table primary key unique, avoid Hive views
Dimension DesignDate/region/product/channelHigh-frequency dimensions first, hierarchy dimensions handle separately
Measure DesignSUM/COUNT/AVG/MAX/MINCount Distinct needs cardinality evaluation
Aggregation GroupDimension grouping, combination controlSplit into multiple groups when dimension >15
Mandatory DimensionMust appear in WHERE/GROUP BYReduce invalid Cuboids
Hierarchy DimensionCountry→Province→CityOnly pre-compute finest granularity
Joint DimensionLow cardinality/often queried togetherEither appear together or both not appear
Build EngineHive + Spark + HBaseSpark build replace MapReduce
Query AccelerationCuboid matchingEnsure query hits pre-computed results

Modeling Flow

1. Data Preparation

-- Fact table: E-commerce sales records
CREATE TABLE fact_sales (
  order_id STRING,
  dt DATE COMMENT 'partition field',
  region_id STRING COMMENT 'region foreign key',
  product_id STRING COMMENT 'product foreign key',
  channel_id STRING COMMENT 'channel foreign key',
  amount DECIMAL(18,2) COMMENT 'sales amount',
  quantity INT COMMENT 'sales quantity'
) PARTITIONED BY (dt) STORED AS PARQUET;

-- Dimension table: Region
CREATE TABLE dim_region (
  region_id STRING PRIMARY KEY,
  region_name STRING,
  province STRING,
  city STRING
) STORED AS ORC;

-- Dimension table: Product
CREATE TABLE dim_product (
  product_id STRING PRIMARY KEY,
  product_name STRING,
  category STRING,
  brand STRING
) STORED AS ORC;

-- Dimension table: Channel
CREATE TABLE dim_channel (
  channel_id STRING PRIMARY KEY,
  channel_name STRING,
  channel_type STRING
) STORED AS ORC;

2. Model Definition

Create Model in Kylin Web UI:

  • Fact table: fact_sales
  • Dimension tables: dim_region, dim_product, dim_channel
  • Partition field: dt (date)
  • Measures: SUM(amount), SUM(quantity), COUNT(*)

3. Cube Design

Dimension Config

Dimension TypeFieldConfig
MandatorydtMandatory dimension, all queries include
Hierarchydim_region.province, dim_region.cityHierarchy dimension
Jointdim_channelJoint dimension
Normaldim_product.category, dim_product.brandNormal dimension

Aggregation Group

Group 1: dt, region_id (high-frequency combination) Group 2: dt, product_id (product analysis) Group 3: dt, channel_id (channel analysis)


Cube Build

Full Build

# Build via Kylin Web UI or REST API
curl -X PUT --user ADMIN:KYLIN \
  -H "Content-Type:application/json;charset=utf-8" \
  -d '{"startTime":"","endTime":"","buildType":"BUILD"}' \
  http://h122.wzk.icu:7070/kylin/api/cubes/ecommerce_cube/build

Build Flow

  1. Step 1: Extract Fact Table - Read Hive fact table
  2. Step 2: Build Dimension Dictionary - Build dimension dictionary
  3. Step 3: Build Cube - Compute each Cuboid
  4. Step 4: Convert Cube to HBase - Convert to HBase storage
  5. Step 5: Checkpoint - Metadata persistence

Query Examples

SQL Queries

-- Sales by date + region
SELECT dt, region_name, SUM(amount) AS total_amount
FROM fact_sales
JOIN dim_region ON fact_sales.region_id = dim_region.region_id
WHERE dt >= '2024-01-01' AND dt <= '2024-01-31'
GROUP BY dt, region_name;

-- Monthly summary by product category
SELECT SUBSTR(dt, 1, 7) AS month, category,
       SUM(amount) AS total, COUNT(*) AS orders
FROM fact_sales
JOIN dim_product ON fact_sales.product_id = dim_product.product_id
GROUP BY SUBSTR(dt, 1, 7), category
ORDER BY month, total DESC;

Query Hit Analysis

View via Kylin Query page:

  • Hit Cuboid: Which pre-computed Cuboid was used
  • Query Time: Whether completed in millisecond level
  • Scan Rows: Number of rows scanned in HBase

Real-time OLAP (Kafka)

Kylin 4.0 supports Kafka real-time stream intake:

  1. Create Streaming Cube: Select Kafka as data source
  2. Configure message format: JSON/Avro
  3. Set time window: Incremental build interval (minute level)
  4. Query: Same syntax as offline Cube query
{
  "dimensions": {"region": "APAC", "product": "phone"},
  "metrics": {"revenue": 12500, "orders": 42},
  "timestamp": "2024-01-15T10:30:00Z"
}

Error Quick Reference

SymptomRoot Cause LocationFix
Build fails, dimension table primary key duplicateDimension table has duplicate primary keysPreprocess dedup, ensure primary key unique
Build extremely slow, too many CuboidsAggregation Group not configuredSplit aggregation groups, configure mandatory/hierarchy dimensions
Query misses, falls back to HiveQuery dimension combination not pre-computedAdd Cuboid or adjust query
HBase storage expansion >1000%High cardinality dimensions + no pruningUse dictionary encoding, optimize dimension combinations
Real-time build latency highKafka consumption lagAdjust batch processing parameters, increase concurrency
Dimension table OOMDimension table too large (>300MB)Split dimension table, reduce column count