Big Data 247 - Offline Data Warehouse - Hive Order Zipper Table: Incremental Refresh Implementation

TL;DR

  • Scenario: In offline data warehouse, need to preserve order historical states at low cost while supporting daily backtracking and change analysis
  • Conclusion: Using ODS daily incremental table + DWD zipper table is more balanced than “keep only latest” and “store full snapshot daily”
  • Output: Provides Hive zipper table initialization, incremental refresh SQL, chain closing rules, and common error quick reference

Periodic Fact Table Definition

Periodic fact tables record regularly occurring business events or metric data, such as daily, monthly, or quarterly data. Their design purpose is to help users quickly query and analyze trends and changes in these periodic data.

Zipper Table Implementation

ODS Layer Order Table

DROP TABLE test.ods_orders;
CREATE TABLE test.ods_orders(
  orderid INT,
  createtime STRING,
  modifiedtime STRING,
  status STRING
) PARTITIONED BY (dt STRING)
row format delimited fields terminated by ',';

DWD Layer Order Zipper Table

DROP TABLE test.dwd_orders;
CREATE TABLE test.dwd_orders(
  orderid INT,
  createtime STRING,
  modifiedtime STRING,
  status STRING,
  start_date STRING,
  end_date STRING
)
row format delimited fields terminated by ',';

Full Initialization

INSERT overwrite TABLE test.dwd_orders
SELECT orderid, createtime, modifiedtime, status,
createtime AS start_date,
'9999-12-31' AS end_date
FROM test.ods_orders
WHERE dt='2020-06-20';

Incremental Refresh

INSERT OVERWRITE TABLE test.dwd_orders
SELECT
    orderid,
    createtime,
    modifiedtime,
    status,
    modifiedtime AS start_date,
    '9999-12-31' AS end_date
FROM test.ods_orders
WHERE dt = '2020-06-21'

UNION ALL

SELECT
    A.orderid,
    A.createtime,
    A.modifiedtime,
    A.status,
    A.start_date,
    CASE
        WHEN B.orderid IS NOT NULL AND A.end_date > '2020-06-21' THEN '2020-06-20'
        ELSE A.end_date
    END AS end_date
FROM dwd_orders A
LEFT JOIN
    (SELECT * FROM ods_orders WHERE dt = '2020-06-21') B
ON A.orderid = B.orderid;

Error Quick Reference

SymptomRoot CauseFix
Multiple “9999-12-31” unclosed records for same orderDuring incremental refresh, only new data inserted, old chain not properly closedCheck if orderid group has more than 1 record with end_date=‘9999-12-31’
Historical interval overlapInconsistent start_date/end_date calculation, wrong chain closing dateCheck if date intervals cross by sorting by orderid
Query snapshot for specific day returns incorrect resultsWrong snapshot filter conditionCheck if using dt to query DWD, or not filtering by date range
New orders normal, modified orders lose historyOriginal record directly overwritten, old version not preservedCompare record count for same orderid before and after refresh
Data doubles after daily batch runUNION ALL front/back set boundaries unclear, historical table fully repeatedCheck if incremental day executed repeatedly, check if idempotence control missing