Big Data 247 - Offline Data Warehouse - Hive Zipper Table Practice: Order History State Incremental Refresh
TL;DR
- Scenario: Need to preserve order history states at low cost in offline data warehouse, while supporting daily backtracking and change analysis
- Conclusion: Using ODS daily incremental table + DWD zipper table is more balanced than “keep only latest” or “store full snapshot daily”
- Output: Provides Hive zipper table initialization, incremental refresh SQL, chain closing rules, and common error quick reference
Periodic Fact Table Definition
Periodic fact tables record regularly occurring business events or metric data, such as daily, monthly, or quarterly data. Their design purpose is to help users quickly query and analyze trends and changes in these periodic data.
Zipper Table Implementation
ODS Layer Order Table
DROP TABLE test.ods_orders;
CREATE TABLE test.ods_orders(
orderid INT,
createtime STRING,
modifiedtime STRING,
status STRING
) PARTITIONED BY (dt STRING)
row format delimited fields terminated by ',';
DWD Layer Order Zipper Table
DROP TABLE test.dwd_orders;
CREATE TABLE test.dwd_orders(
orderid INT,
createtime STRING,
modifiedtime STRING,
status STRING,
start_date STRING,
end_date STRING
)
row format delimited fields terminated by ',';
Full Initialization
INSERT overwrite TABLE test.dwd_orders
SELECT orderid, createtime, modifiedtime, status,
createtime AS start_date,
'9999-12-31' AS end_date
FROM test.ods_orders
WHERE dt='2020-06-20';
Incremental Refresh
INSERT OVERWRITE TABLE test.dwd_orders
SELECT
orderid,
createtime,
modifiedtime,
status,
modifiedtime AS start_date,
'9999-12-31' AS end_date
FROM test.ods_orders
WHERE dt = '2020-06-21'
UNION ALL
SELECT
A.orderid,
A.createtime,
A.modifiedtime,
A.status,
A.start_date,
CASE
WHEN B.orderid IS NOT NULL AND A.end_date > '2020-06-21' THEN '2020-06-20'
ELSE A.end_date
END AS end_date
FROM dwd_orders A
LEFT JOIN
(SELECT * FROM ods_orders WHERE dt = '2020-06-21') B
ON A.orderid = B.orderid;
Error Quick Reference
| Symptom | Root Cause | Fix |
|---|---|---|
| Multiple “9999-12-31” unclosed records for same order in zipper table | During incremental refresh, only new data was inserted, old chain not properly closed | Check if orderid group has more than 1 record with end_date=‘9999-12-31’ |
| Historical interval overlap | Inconsistent start_date, end_date calculation, wrong chain closing date | Check if date intervals cross by sorting by orderid |
| Query snapshot for a specific day returns incorrect results | Wrong snapshot filter condition | Check if using dt to query DWD, or not filtering by date range |
| New orders are normal, modified orders lose history | Original record was directly overwritten, old version not preserved | Compare record count for same orderid before and after refresh |
| Data doubles after daily batch run | UNION ALL front and back set boundaries unclear, historical table fully repeated concatenation | Check if incremental day was executed repeatedly, check if idempotence control is missing |