This article is migrated from Juejin. Original link: Big Data-247 Offline Data Warehouse - E-commerce Analysis Hive SCD Type 2

TL;DR

  • Scenario: Offline data warehouse needs to save order history state at low cost while supporting daily rollback and change analysis
  • Solution: Using ODS daily incremental table + DWD SCD Type 2, more balanced than “keep latest only” and “store full snapshot daily”
  • Output: Hive SCD Type 2 initialization, incremental refresh SQL, close chain rules and common error quick reference

What is SCD Type 2

Slowly Changing Dimension Type 2 is a commonly used dimension table design approach in data warehouses, used to record historical changes of data.

Scenario Example

Assume there’s an order table:

  • June 20: 3 records (001/002/003)
  • June 21: 5 records, add 2 new (004/005), modify 1 (001)
  • June 22: 6 records, add 1 new (006), modify 2 (003/005)

Common handling approaches:

ApproachProblem
Keep latest onlyCannot view historical state
Store full snapshot dailyStorage waste, data duplication

SCD Type 2 balances this well:

-- SCD Type 2 structure
CREATE TABLE dwd_orders(
  orderid INT,
  createtime STRING,
  modifiedtime STRING,
  status STRING,
  start_date STRING,  -- Validity start
  end_date STRING    -- Validity end, '9999-12-31' means currently valid
);

SCD Type 2 Implementation

Prerequisites

  • ODS layer has order table, data partitioned by day, stores daily incremental data
  • Order status includes: created, paid, completed
  • Refresh frequency is once per day, get previous day’s incremental data

Full Initialization

-- Load initial data
LOAD DATA LOCAL INPATH '/data/order1.dat' INTO TABLE test.ods_orders PARTITION(dt='2020-06-20');

-- Initialize SCD Type 2
INSERT OVERWRITE TABLE test.dwd_orders
SELECT
  orderid,
  createtime,
  modifiedtime,
  status,
  createtime AS start_date,
  '9999-12-31' AS end_date
FROM test.ods_orders
WHERE dt='2020-06-20';

Incremental Refresh

-- 1. Handle new data
INSERT OVERWRITE TABLE test.dwd_orders
SELECT
  orderid,
  createtime,
  modifiedtime,
  status,
  modifiedtime AS start_date,
  '9999-12-31' AS end_date
FROM test.ods_orders
WHERE dt = '2020-06-21'

UNION ALL

-- 2. Handle historical data (close chain)
SELECT
  A.orderid,
  A.createtime,
  A.modifiedtime,
  A.status,
  A.start_date,
  CASE
    WHEN B.orderid IS NOT NULL AND A.end_date > '2020-06-21'
    THEN '2020-06-20'  -- Close chain date
    ELSE A.end_date
  END AS end_date
FROM test.dwd_orders A
LEFT JOIN (
  SELECT * FROM test.ods_orders WHERE dt = '2020-06-21'
) B ON A.orderid = B.orderid;

Query Snapshot of Specific Day

-- Query order status on 2020-06-21
SELECT *
FROM dwd_orders
WHERE start_date <= '2020-06-21'
  AND end_date >= '2020-06-21';

Common Error Quick Reference

SymptomRoot CauseFix
Multiple “9999-12-31” not closed for same orderDuring incremental refresh only inserts new data, old chain not properly closedCheck record count with end_date='9999-12-31'
Historical intervals overlapClose chain date calculation errorCheck date intervals sorted by orderid
Query snapshot for specific day incorrectFilter condition errorCheck if filtering by date interval
New orders normal, modified orders lose historyDirect overwrite, didn’t keep old versionCompare record count before and after refresh
Data doubles after daily batchUNION ALL boundary unclearCheck if incremental day executed repeatedly

Summary

SCD Type 2 is an effective solution for handling historical state changes in offline data warehouse:

  • Advantages: Balances historical tracing and storage cost
  • Key: Correctly handle incremental close chain logic
  • Note: Ensure idempotency control, avoid repeated execution causing data duplication

This article was first published on Juejin