Big Data 245 - Offline Data Warehouse - Hive Zipper Table Getting Started: SCD Types, Table Creation and Loading

Slowly Changing Dimensions

Slowly Changing Dimensions (SCD) refer to dimension attributes that change slowly over time in the real world (slow is relative to fact tables, where data changes faster than dimension tables).

The problem of processing historical change information in dimension tables is called the SCD problem. Common methods for handling slowly changing dimensions include:

  • Keep original value
  • Direct overwrite
  • Add new attribute column
  • Snapshot table
  • Zipper table

Types of Slowly Changing Dimensions

SCD Type 0: No Processing of Changes

  • Characteristics: When attributes change, changes are not recorded, only the latest value is kept.
  • Applicable scenarios: Dimension attributes have no impact on historical analysis.
  • Advantages: Simple to implement.
  • Disadvantages: Cannot preserve historical information.

SCD Type 1: Overwrite Changes

  • Characteristics: When attributes change, overwrite the old value directly.
  • Applicable scenarios: Only need to keep the latest dimension information, historical data is not important.
  • Advantages: Small storage space, high query efficiency.
  • Disadvantages: Cannot trace historical information, loses data change records.
  • Example: When customer address changes, only update the address field.

SCD Type 2: Preserve Historical Records

  • Characteristics: Create a new record for each change, and can distinguish current data from historical data through identification fields or timestamps.

Implementation methods:

  • Add version number: Add a version number field to indicate record version.

  • Add effective time interval: Add start and end time fields to indicate the validity period of the record.

  • Applicable scenarios: Need to preserve all historical information, support time-based backtracking queries.

  • Advantages: Complete historical change records.

  • Disadvantages: Increased data volume, query complexity may increase.

  • Example: When customer address changes, keep the old address record, create a new record for the new address.

SCD Type 3: Limited Historical Records

  • Characteristics: Set additional fields for changed attributes, keeping only limited historical information (e.g., most recent change).
  • Applicable scenarios: Only need to save part of historical information, lower storage space requirements.
  • Advantages: Reduced data volume, low storage requirements.
  • Disadvantages: Limited historical records, cannot meet more complex backtracking analysis.
  • Example: Add “old address” and “current address” two fields.

SCD Type 4: History Table

  • Characteristics: Store historical records in a separate history table, main table only keeps current data.
  • Applicable scenarios: Need to completely preserve historical information while keeping the main table concise.
  • Advantages: Simple main table data, efficient query for current values.
  • Disadvantages: Requires additional history table, increased complexity when querying historical information.
  • Example: Main table stores customer’s latest address, history table stores address change records.

SCD Type 6: Hybrid Type

  • Characteristics: Combines Type 1, 2, and 3, keeping both the latest value and limited historical information, while also preserving complete historical records.
  • Applicable scenarios: Need to balance historical records and current value query efficiency.
  • Advantages: Combines advantages of multiple types.
  • Disadvantages: More complex implementation.
  • Example: Main table records latest information, while adding version number and timestamp to trace history.

Keep Original Value

Dimension attribute values are not changed, keeping the original value.

Direct Overwrite

Modify dimension attribute to the latest value, directly overwrite, without preserving historical information.

Add New Attribute Column

Add a new column to the dimension table, the original attribute column stores the previous version’s attribute value, the current attribute column stores the current version’s attribute value.

Snapshot Table

Retain a full snapshot of data every day. Simple and efficient. The downside is information duplication and wasted disk space. Applicable to dimension tables that are not too large.

Zipper Table

Zipper tables are applicable when: the table has large data volume, and the data has additions and changes, but most of it remains unchanged and changes slowly.

Applicable scenarios:

  • Large data volume in the table
  • Some fields in the table will be updated
  • The proportion of record changes in the table is not high
  • Need to preserve historical information

Dimension Table Zipper Table Application Case

Create Tables

CREATE DATABASE test;

-- User information (partitioned table)
DROP TABLE IF EXISTS test.userinfo;
CREATE TABLE test.userinfo(
  userid STRING COMMENT 'User ID',
  mobile STRING COMMENT 'Mobile number',
  regdate STRING COMMENT 'Registration date')
COMMENT 'User information'
PARTITIONED BY (dt string)
row format delimited fields terminated by ',';

-- Zipper table (stores user historical information)
-- Zipper table is not a partitioned table; has two additional fields: start_date, end_date
DROP TABLE IF EXISTS test.userhis;
CREATE TABLE test.userhis(
  userid STRING COMMENT 'User ID',
  mobile STRING COMMENT 'Mobile number',
  regdate STRING COMMENT 'Registration date',
  start_date STRING,
  end_date STRING)
COMMENT 'User information zipper table'
row format delimited fields terminated by ',';

Data File (userinfo.dat)

001,13551111111,2020-03-01,2020-06-20
002,13561111111,2020-04-01,2020-06-20
003,13571111111,2020-05-01,2020-06-20
004,13581111111,2020-06-01,2020-06-20
002,13562222222,2020-04-01,2020-06-21
004,13582222222,2020-06-01,2020-06-21
005,13552222222,2020-06-21,2020-06-21
004,13333333333,2020-06-01,2020-06-22
005,13533333333,2020-06-21,2020-06-22
006,13733333333,2020-06-22,2020-06-22
001,13554444444,2020-03-01,2020-06-23
003,13574444444,2020-05-01,2020-06-23
005,13555554444,2020-06-21,2020-06-23
007,18600744444,2020-06-23,2020-06-23
008,18600844444,2020-06-23,2020-06-23

Load Data

-- Create intermediate table (non-partitioned table)
DROP TABLE IF EXISTS test.tmp1;
CREATE TABLE test.tmp1 AS
SELECT * FROM test.userinfo;

-- Set field delimiter for tmp1 non-partitioned table to ','
ALTER TABLE test.tmp1 SET SERDEPROPERTIES('field.delim' = ',');

-- Load data into intermediate table
LOAD DATA LOCAL INPATH '/opt/wzk/userinfo.dat' INTO TABLE test.tmp1;

-- Enable nonstrict mode for dynamic partitioning
SET hive.exec.dynamic.partition.mode = nonstrict;

-- Load data from intermediate table to partitioned table
INSERT INTO TABLE test.userinfo
PARTITION (dt)
SELECT * FROM test.tmp1;

Error Quick Reference

SymptomRoot CauseFix
Readers still cannot distinguish SCD from zipper table after readingConfusing “concept classification” with “engineering implementation”Clarify: zipper table is usually one implementation of SCD Type 2
SCD Type 0 contradicts “keep original value”Type 0 incorrectly written as “keep latest value”Change Type 0 to “never updates with source data changes, always keeps initial value”
Field misalignment or delimiter failure after LOAD DATAIntermediate table default SerDe inconsistent with source file delimiterSpecify delimiter directly when creating tmp1, avoid depending on CTAS then modifying SerDe
Dynamic partition insert errorNon-strict dynamic partition not enabled or column order doesn’t matchCheck and enable nonstrict for hive.exec.dynamic.partition.mode
Queried data is not “zipper table history”Only completed original dimension table loading, didn’t actually do historical chain closingAdd userhis initialization and daily incremental zipper SQL
Historical query results duplicate or distortedDidn’t explain current record end date sentinel valueClarify current record end_date=‘9999-12-31’ or business max date
Unclear snapshot table vs zipper table selectionNo cost/query complexity comparison providedAdd selection rules: small tables with high readability priority to snapshot table, large tables with low change rate and need to trace history priority to zipper table