Apache Kylin Cube Practice: Hive Load & Pre-computation A...

Cube Introduction

Apache Kylin is an open-source distributed analysis engine, focused on providing real-time OLAP (Online Analytical Processing) capabilities for big data. Cube is one of Apache Kylin’s core concepts, pre-computes multi-dimensional data sets for large-scale data to accelerate complex SQL queries.

Cube Basic Concepts

Cube in Kylin is a pre-computed data structure generated after multi-dimensional modeling on a set of fact tables (usually business data tables). Cube involves combinations of dimensions and metrics on multi-dimensional data, enabling pre-computed results retrieval at query time to significantly reduce computation overhead.

Dimension: Fields in data used for grouping, filtering and slicing, such as time, region, product, etc.
Measure: Fields that usually need aggregation calculation, such as sales amount, order count, etc.
Cuboid: Each Cube consists of multiple Cuboids, a Cuboid is a subset of specific dimension combination. Each dimension combination in Cube generates a Cuboid, each Cuboid stores pre-aggregated results under that combination.

Cube Creation Process

Data Modeling: First create a Data Model in Kylin, this model defines relationships between fact tables and dimension tables, similar to star or snowflake schema. Model also defines metric fields to aggregate.
Cube Design: Design Cube based on Data Model, specify Cube’s dimensions and measures. Kylin automatically calculates all possible dimension combinations (Cuboids) based on definitions.
Build Cube: Build process reads underlying data sources (like Hive, HBase, Kafka), then generates pre-computed data for each Cuboid based on specified dimensions and measures. These pre-computed results stored in HBase or other storage engines.

Cube Query and Optimization

Query Acceleration: When SQL query request arrives, Kylin selects appropriate Cuboid to return based on dimension combinations involved in query, avoids real-time computation, greatly improves query performance.
Cube Optimization: To control Cube size and accelerate build, Kylin supports Cube pruning, configures to generate only partial Cuboids, called “Aggregation Group”, can reduce redundant computation.

Real-time OLAP

Kylin 4.0 introduced real-time OLAP support, using Kafka as real-time data stream input to build real-time Cube. Using Lambda architecture, Kylin can support integrated analysis of real-time and batch data.

Typical Cube Application Scenarios

Large-scale Data Analysis: Cube suitable for analyzing ultra-large-scale datasets, accelerates queries through pre-computation.
Real-time Analysis: Real-time Cube allows users to analyze streaming data in near real-time.
Business Intelligence (BI) Tool Integration: Kylin provides integration with common BI tools like Tableau, Power BI, users can use familiar SQL query language for complex multi-dimensional analysis.

Prerequisites

Need to have configured and started Kylin!

Prepare Data

Data Files

dw_sales_data.txt
dim_channel_data.txt
dim_product_data.txt
dim_region_data.txt

Data Generation Script (Python)

dw_sales_data:

import random
import datetime

# Set parameters
num_records = 1000
output_file = 'dw_sales_data.txt'

# Define possible values
channel_ids = ['C001', 'C002', 'C003', 'C004']
product_ids = ['P001', 'P002', 'P003', 'P004']
region_ids = ['R001', 'R002', 'R003', 'R004']
base_date = datetime.date(2024, 1, 1)

# Generate data
with open(output_file, 'w') as f:
    for i in range(num_records):
        record_id = f"{i+1:04d}"
        date1 = (base_date + datetime.timedelta(days=random.randint(0, 365))).strftime('%Y-%m-%d')
        channel_id = random.choice(channel_ids)
        product_id = random.choice(product_ids)
        region_id = random.choice(region_ids)
        amount = random.randint(1, 100)
        price = round(random.uniform(10.0, 500.0), 2)

        line = f"{record_id},{date1},{channel_id},{product_id},{region_id},{amount},{price}\n"
        f.write(line)

print(f"{num_records} records have been written to {output_file}")

dim_channel_data:

output_file = 'dim_channel_data.txt'

channels = [
    ('C001', 'Online Sales'),
    ('C002', 'Retail Store'),
    ('C003', 'Wholesale'),
    ('C004', 'Direct Sales')
]

with open(output_file, 'w') as f:
    for channel_id, channel_name in channels:
        line = f"{channel_id},{channel_name}\n"
        f.write(line)

print(f"Channel data has been written to {output_file}")

dim_product_data:

output_file = 'dim_product_data.txt'

products = [
    ('P001', 'Smartphone'),
    ('P002', 'Laptop'),
    ('P003', 'Tablet'),
    ('P004', 'Smartwatch'),
    ('P005', 'Camera'),
    ('P006', 'Headphones'),
    ('P007', 'Monitor'),
    ('P008', 'Keyboard'),
    ('P009', 'Mouse'),
    ('P010', 'Printer')
]

with open(output_file, 'w') as f:
    for product_id, product_name in products:
        line = f"{product_id},{product_name}\n"
        f.write(line)

print(f"Product data has been written to {output_file}")

dim_region_data:

output_file = 'dim_region_data.txt'

regions = [
    ('R001', 'North America'),
    ('R002', 'Europe'),
    ('R003', 'Asia'),
    ('R004', 'South America'),
    ('R005', 'Africa'),
    ('R006', 'Australia'),
    ('R007', 'Antarctica')
]

with open(output_file, 'w') as f:
    for region_id, region_name in regions:
        line = f"{region_id},{region_name}\n"
        f.write(line)

print(f"Region data has been written to {output_file}")

Hive Create Table & Data Load SQL

-- Create order database, table structure
create database if not exists `wzk_kylin`;
-- 1, Sales table: dw_sales
create table wzk_kylin.dw_sales(
  id string,
  date1 string,
  channelId string,
  productId string,
  regionId string,
  amount int,
  price double
)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
-- 2, Channel table: dim_channel
create table wzk_kylin.dim_channel(
  channelId string,
  channelName string
)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
-- 3, Product table: dim_product
create table wzk_kylin.dim_product(
  productId string,
  productName string
)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
-- 4, Region table: dim_region
create table wzk_kylin.dim_region(
  regionId string,
  regionName string
)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

-- Load data
LOAD DATA LOCAL INPATH '/opt/wzk/kylin_test/dw_sales_data.txt'
OVERWRITE INTO TABLE wzk_kylin.dw_sales;
LOAD DATA LOCAL INPATH '/opt/wzk/kylin_test/dim_channel_data.txt'
OVERWRITE INTO TABLE wzk_kylin.dim_channel;
LOAD DATA LOCAL INPATH '/opt/wzk/kylin_test/dim_product_data.txt'
OVERWRITE INTO TABLE wzk_kylin.dim_product;
LOAD DATA LOCAL INPATH '/opt/wzk/kylin_test/dim_region_data.txt'
OVERWRITE INTO TABLE wzk_kylin.dim_region;

Execute Data Load

cd /opt/wzk/kylin_test
hive -f kylin_examples.sql

Test Data

hive

use wzk_kylin;
select date1, sum(price) as total_money, sum(amount) as total_amount
from dw_sales
group by date1;

Error Quick Reference

Symptom	Root Cause Location	Fix
Table not found wzk_kylin.dw_sales	Database/table not created or case mismatch	Re-execute create table script; confirm case and database prefix consistent
LOAD DATA LOCAL fails or data empty	Local path error/insufficient permission/separator mismatch	Fix path and permission; confirm CSV separator and column order consistent
Kylin build task stuck or failed	HDFS/YARN queue/insufficient permissions or MapReduce parameters	Increase queue resources; tune build parallelism and memory in Kylin
Query hits slow (not using pre-computation)	Dimension/measure or Aggregation Group design improper	Simplify dimension combinations, prune Cuboids reasonably; create separate AG for high-frequency queries
No realize for query	Cube doesn’t cover that dimension combination or time partition	Expand dimension combinations or supplement build that partition Segment
Real-time ingestion latency large	Kafka Topic/Schema doesn’t match model	Align Schema; increase batch/cache; verify timestamp and watermark
RegionTooBusy/Timeout (HBase)	Region hot spot or memory parameters too small	Pre-partition/hot-cold separation; increase MemStore/BlockCache
Permission related errors (Kerberos/ACL)	Cluster enabled security authentication	Check keytab and principal; acquire tickets and configure Kylin/Hive client