Big Data 247 - Offline Data Warehouse - Hive Zipper Table Practice: Order History State Incremental Refresh

TL;DR

  • Scenario: Need to preserve order history states at low cost in offline data warehouse, while supporting daily backtracking and change analysis
  • Conclusion: Using ODS daily incremental table + DWD zipper table is more balanced than “keep only latest” or “store full snapshot daily”
  • Output: Provides Hive zipper table initialization, incremental refresh SQL, chain closing rules, and common error quick reference

Periodic Fact Table Definition

Periodic fact tables record regularly occurring business events or metric data, such as daily, monthly, or quarterly data. Their design purpose is to help users quickly query and analyze trends and changes in these periodic data.

Zipper Table Implementation

ODS Layer Order Table

DROP TABLE test.ods_orders;
CREATE TABLE test.ods_orders(
  orderid INT,
  createtime STRING,
  modifiedtime STRING,
  status STRING
) PARTITIONED BY (dt STRING)
row format delimited fields terminated by ',';

DWD Layer Order Zipper Table

DROP TABLE test.dwd_orders;
CREATE TABLE test.dwd_orders(
  orderid INT,
  createtime STRING,
  modifiedtime STRING,
  status STRING,
  start_date STRING,
  end_date STRING
)
row format delimited fields terminated by ',';

Full Initialization

INSERT overwrite TABLE test.dwd_orders
SELECT orderid, createtime, modifiedtime, status,
createtime AS start_date,
'9999-12-31' AS end_date
FROM test.ods_orders
WHERE dt='2020-06-20';

Incremental Refresh

INSERT OVERWRITE TABLE test.dwd_orders
SELECT
    orderid,
    createtime,
    modifiedtime,
    status,
    modifiedtime AS start_date,
    '9999-12-31' AS end_date
FROM test.ods_orders
WHERE dt = '2020-06-21'

UNION ALL

SELECT
    A.orderid,
    A.createtime,
    A.modifiedtime,
    A.status,
    A.start_date,
    CASE
        WHEN B.orderid IS NOT NULL AND A.end_date > '2020-06-21' THEN '2020-06-20'
        ELSE A.end_date
    END AS end_date
FROM dwd_orders A
LEFT JOIN
    (SELECT * FROM ods_orders WHERE dt = '2020-06-21') B
ON A.orderid = B.orderid;

Error Quick Reference

SymptomRoot CauseFix
Multiple “9999-12-31” unclosed records for same order in zipper tableDuring incremental refresh, only new data was inserted, old chain not properly closedCheck if orderid group has more than 1 record with end_date=‘9999-12-31’
Historical interval overlapInconsistent start_date, end_date calculation, wrong chain closing dateCheck if date intervals cross by sorting by orderid
Query snapshot for a specific day returns incorrect resultsWrong snapshot filter conditionCheck if using dt to query DWD, or not filtering by date range
New orders are normal, modified orders lose historyOriginal record was directly overwritten, old version not preservedCompare record count for same orderid before and after refresh
Data doubles after daily batch runUNION ALL front and back set boundaries unclear, historical table fully repeated concatenationCheck if incremental day was executed repeatedly, check if idempotence control is missing