This article is migrated from Juejin. Original link: Big Data-247 Offline Data Warehouse - E-commerce Analysis Hive SCD Type 2
TL;DR
- Scenario: Offline data warehouse needs to save order history state at low cost while supporting daily rollback and change analysis
- Solution: Using ODS daily incremental table + DWD SCD Type 2, more balanced than “keep latest only” and “store full snapshot daily”
- Output: Hive SCD Type 2 initialization, incremental refresh SQL, close chain rules and common error quick reference
What is SCD Type 2
Slowly Changing Dimension Type 2 is a commonly used dimension table design approach in data warehouses, used to record historical changes of data.
Scenario Example
Assume there’s an order table:
- June 20: 3 records (001/002/003)
- June 21: 5 records, add 2 new (004/005), modify 1 (001)
- June 22: 6 records, add 1 new (006), modify 2 (003/005)
Common handling approaches:
| Approach | Problem |
|---|---|
| Keep latest only | Cannot view historical state |
| Store full snapshot daily | Storage waste, data duplication |
SCD Type 2 balances this well:
-- SCD Type 2 structure
CREATE TABLE dwd_orders(
orderid INT,
createtime STRING,
modifiedtime STRING,
status STRING,
start_date STRING, -- Validity start
end_date STRING -- Validity end, '9999-12-31' means currently valid
);
SCD Type 2 Implementation
Prerequisites
- ODS layer has order table, data partitioned by day, stores daily incremental data
- Order status includes: created, paid, completed
- Refresh frequency is once per day, get previous day’s incremental data
Full Initialization
-- Load initial data
LOAD DATA LOCAL INPATH '/data/order1.dat' INTO TABLE test.ods_orders PARTITION(dt='2020-06-20');
-- Initialize SCD Type 2
INSERT OVERWRITE TABLE test.dwd_orders
SELECT
orderid,
createtime,
modifiedtime,
status,
createtime AS start_date,
'9999-12-31' AS end_date
FROM test.ods_orders
WHERE dt='2020-06-20';
Incremental Refresh
-- 1. Handle new data
INSERT OVERWRITE TABLE test.dwd_orders
SELECT
orderid,
createtime,
modifiedtime,
status,
modifiedtime AS start_date,
'9999-12-31' AS end_date
FROM test.ods_orders
WHERE dt = '2020-06-21'
UNION ALL
-- 2. Handle historical data (close chain)
SELECT
A.orderid,
A.createtime,
A.modifiedtime,
A.status,
A.start_date,
CASE
WHEN B.orderid IS NOT NULL AND A.end_date > '2020-06-21'
THEN '2020-06-20' -- Close chain date
ELSE A.end_date
END AS end_date
FROM test.dwd_orders A
LEFT JOIN (
SELECT * FROM test.ods_orders WHERE dt = '2020-06-21'
) B ON A.orderid = B.orderid;
Query Snapshot of Specific Day
-- Query order status on 2020-06-21
SELECT *
FROM dwd_orders
WHERE start_date <= '2020-06-21'
AND end_date >= '2020-06-21';
Common Error Quick Reference
| Symptom | Root Cause | Fix |
|---|---|---|
| Multiple “9999-12-31” not closed for same order | During incremental refresh only inserts new data, old chain not properly closed | Check record count with end_date='9999-12-31' |
| Historical intervals overlap | Close chain date calculation error | Check date intervals sorted by orderid |
| Query snapshot for specific day incorrect | Filter condition error | Check if filtering by date interval |
| New orders normal, modified orders lose history | Direct overwrite, didn’t keep old version | Compare record count before and after refresh |
| Data doubles after daily batch | UNION ALL boundary unclear | Check if incremental day executed repeatedly |
Summary
SCD Type 2 is an effective solution for handling historical state changes in offline data warehouse:
- Advantages: Balances historical tracing and storage cost
- Key: Correctly handle incremental close chain logic
- Note: Ensure idempotency control, avoid repeated execution causing data duplication
This article was first published on Juejin