Big Data 248 - Offline Data Warehouse: Dimension Tables

Basic Introduction

First, determine which are fact tables and which are dimension tables:

  • Green indicates fact tables
  • Gray indicates dimension tables

How to process dimension tables - daily snapshot or zipper table?

  • Small tables use daily snapshot tables: Product category table, merchant store table, merchant region organization table, payment method table
  • Large tables use zipper tables: Product information table

Product Category Table

Normalization vs Denormalization

Database normalization is a set of guidelines for designing relational database structures, aimed at reducing data redundancy, ensuring reasonable data dependencies, and improving data consistency. However, there are also potential disadvantages to following normalization:

  • Performance Issues: Highly normalized databases may cause slower queries and join operations, as complex joins across multiple tables are needed to get complete information
  • Increased Complexity: As normalization goes deeper, the database schema becomes more complex and harder to maintain
  • Over-engineering: Sometimes over-pursuing normalization leads to over-engineering for simple scenarios
  • Inefficient Read Operations: In some cases, to ensure data integrity during writes, normalization can lead to inefficient frequent read operations

Solutions:

  • Choose appropriate normalization levels
  • Denormalization
  • Use caching mechanisms
  • Partitioning and sharding
  • Index optimization
  • Evaluate business requirements

Create Table

DROP TABLE IF EXISTS dim.dim_trade_product_cat;
create table if not exists dim.dim_trade_product_cat(
firstId int, -- Level 1 product category ID
firstName string, -- Level 1 product category name
secondId int, -- Level 2 product category ID
secondName string, -- Level 2 product category name
thirdId int, -- Level 3 product category ID
thirdName string -- Level 3 product category name
)
partitioned by (dt string)
STORED AS PARQUET;

Data Loading Script

vim dim_load_product_cat.sh
source /etc/profile
if [ -n "$1" ]
then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
insert overwrite table dim.dim_trade_product_cat
partition(dt='$do_date')
select
t1.catid, -- Level 1 category ID
t1.catname, -- Level 1 category name
t2.catid, -- Level 2 category ID
t2.catname, -- Level 2 category name
t3.catid, -- Level 3 category ID
t3.catname -- Level 3 category name
from
-- Product level 3 category data
(select catid, catname, parentid
from ods.ods_trade_product_category
where level=3 and dt='$do_date') t3
left join
-- Product level 2 category data
(select catid, catname, parentid
from ods.ods_trade_product_category
where level=2 and dt='$do_date') t2
on t3.parentid = t2.catid
left join
-- Product level 1 category data
(select catid, catname, parentid
from ods.ods_trade_product_category
where level=1 and dt='$do_date') t1
on t2.parentid = t1.catid;
"
hive -e "$sql"

Product Region Organization Table

Create Table

Merchant store table, merchant region organization table => one dimension table. This is also denormalized design - organize merchant store table and merchant region organization table into one table and widen it.

drop table if exists dim.dim_trade_shops_org;
create table dim.dim_trade_shops_org(
shopid int,
shopName string,
cityId int,
cityName string ,
regionId int ,
regionName string
)
partitioned by (dt string)
STORED AS PARQUET;

Data Loading Script

vim dim_load_shop_org.sh
#!/bin/bash
source /etc/profile
if [ -n "$1" ]
then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
insert overwrite table dim.dim_trade_shops_org
partition(dt='$do_date')
select t1.shopid,
t1.shopname,
t2.id as cityid,
t2.orgname as cityName,
t3.id as region_id,
t3.orgname as region_name
from (select shopId, shopName, areaId
from ods.ods_trade_shops
where dt='$do_date') t1
left join
(select id, parentId, orgname, orglevel
from ods.ods_trade_shop_admin_org
where orglevel=2 and dt='$do_date') t2
on t1.areaid = t2.id
left join
(select id, parentId, orgname, orglevel
from ods.ods_trade_shop_admin_org
where orglevel=1 and dt='$do_date') t3
on t2.parentid = t3.id;
"
hive -e "$sql"

Product Information Table

Data Processing

Use zipper table for product information processing.

Historical Data

Historical data => Initialize zipper table (start date: current day, end date: 9999-12-31) - execute only once.

Daily Data

  • New data: Daily new data (ODS) => Start date: current day, end date: 9999-12-31
  • Historical data: Zipper table (DIM) left join daily new data (ODS) - if join has data and data changed, end date becomes current day; if no join data and no change, end date stays unchanged

Create Dimension Table

Zipper table needs to add two columns to record effective date and expiration date:

drop table if exists dim.dim_trade_product_info;
create table dim.dim_trade_product_info(
  `productId` bigint,
  `productName` string,
  `shopId` string,
  `price` decimal,
  `isSale` tinyint,
  `status` tinyint,
  `categoryId` string,
  `createTime` string,
  `modifyTime` string,
  `start_dt` string,
  `end_dt` string
) COMMENT 'Product Table'
STORED AS PARQUET;

Initial Data Load

Historical data loading, only need to execute once:

insert overwrite table dim.dim_trade_product_info
select productId,
productName,
shopId,
price,
isSale,
status,
categoryId,
createTime,
modifyTime,
-- If modifyTime not null take modifyTime, otherwise take createTime; substr takes date
case when modifyTime is not null
then substr(modifyTime, 0, 10)
else substr(createTime, 0, 10)
end as start_dt,
'9999-12-31' as end_dt
from ods.ods_trade_product_info
where dt = '2020-07-12';

Incremental Data Import Script

Execute repeatedly, run each time data loads:

vim dim_load_product_info.sh
#!/bin/bash
source /etc/profile
if [ -n "$1" ]
then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
insert overwrite table dim.dim_trade_product_info
select productId,
productName,
shopId,
price,
isSale,
status,
categoryId,
createTime,
modifyTime,
case when modifyTime is not null
then substr(modifyTime,0,10)
else substr(createTime,0,10)
end as start_dt,
'9999-12-31' as end_dt
from ods.ods_trade_product_info
where dt='$do_date'
union all
select dim.productId,
dim.productName,
dim.shopId,
dim.price,
dim.isSale,
dim.status,
dim.categoryId,
dim.createTime,
dim.modifyTime,
dim.start_dt,
case when dim.end_dt >= '9999-12-31' and ods.productId is not null
then '$do_date'
else dim.end_dt
end as end_dt
from dim.dim_trade_product_info dim left join
(select *
from ods.ods_trade_product_info
where dt='$do_date' ) ods
on dim.productId = ods.productId
"
hive -e "$sql"