Big Data 246 - Offline Data Warehouse - Hive Zipper Table Practice: Initialization, Incremental Update, Rollback Script
Zipper Table Implementation
userinfo (partitioned table) => userid, mobile, regdate => Daily changed data (modified + new) / Historical data (first day) userhis (zipper table) => Two additional fields: start_date / end_date
Zipper Table
A zipper table is a database design pattern used to track data changes over time while maintaining efficient query performance. This pattern is widely used in data warehouse and data analysis scenarios because it can well record historical data changes.
Basic Concepts of Zipper Table
The core idea of a zipper table is to store the effective time range of each record, recording version changes through a “zipper” method. Each record contains the following key information:
- Start Date/Effective Date: Represents the effective time of this record.
- End Date/Expiration Date: Represents the expiration time of this record.
- Is Current: Represents whether this record is the latest version (usually stored through a flag, e.g., 1 for current record, 0 for historical record).
Working Principle
- New Data: When new data is inserted, the system creates a new record, sets its start time to the current time, end time to a default maximum time (e.g., 9999-12-31), and sets is_current field to 1.
- Update Data: First, update the end time of the existing valid record to the current time, indicating its validity period has ended, and set is_current flag to 0. Then insert a new record representing the updated version, with start time as current time, end time as default maximum time, and is_current flag as 1.
- Delete Data: Generally implemented through logical deletion (updating end time and is_current field) rather than direct physical deletion.
Prepare Data
-- 1. Initialize userinfo (2020-06-20). Get historical data
001,13551111111,2020-03-01,2020-06-20
002,13561111111,2020-04-01,2020-06-20
003,13571111111,2020-05-01,2020-06-20
004,13581111111,2020-06-01,2020-06-20
Initialize Zipper Table
Write data from 2020-06-20 to the table
-- 2. Initialize zipper table (2020-06-20). userinfo => userhis
INSERT OVERWRITE TABLE test.userhis
SELECT
userid,
mobile,
regdate,
dt AS start_date,
'9999-12-31' AS end_date
FROM
test.userinfo
WHERE
dt = '2020-06-20';
Continue Preparing Data
-- 3. Next day new data (2020-06-21); Get new data
002,13562222222,2020-04-01,2020-06-21
004,13582222222,2020-06-01,2020-06-21
005,13552222222,2020-06-21,2020-06-21
Build Zipper Table
-- 4. Build zipper table (userhis) (2020-06-21) 【Core】 userinfo(2020-06-21) + userhis => userhis
-- Step 1: Process new data 【userinfo】(Similar logic to loading historical data)
SELECT
userid,
mobile,
regdate,
dt AS start_date,
'9999-12-31' AS end_date
FROM
test.userinfo
WHERE
dt = '2020-06-21';
-- Step 2: Process historical data 【userhis】(Historical includes two parts: changed, unchanged)
-- Changed: start_date: unchanged; end_date: date - 1
-- Unchanged: no processing
-- Observe data
SELECT
A.userid,
B.userid,
B.mobile,
B.regdate,
B.start_date,
B.end_date
FROM
(SELECT *
FROM test.userinfo
WHERE dt = '2020-06-21') A
RIGHT JOIN
test.userhis B
ON
A.userid = B.userid;
-- Write SQL to process historical data
SELECT
B.userid,
B.mobile,
B.regdate,
B.start_Date,
CASE
WHEN B.end_date = '9999-12-31' AND A.userid IS NOT NULL
THEN DATE_ADD('2020-06-21', INTERVAL -1 DAY)
ELSE B.end_date
END AS end_date
FROM
(SELECT * FROM test.userinfo WHERE dt = '2020-06-21') A
RIGHT JOIN
test.userhis B
ON
A.userid = B.userid;
-- Final processing (new data + historical data)
INSERT OVERWRITE TABLE test.userhis
SELECT
userid,
mobile,
regdate,
dt AS start_date,
'9999-12-31' AS end_date
FROM
test.userinfo
WHERE
dt = '2020-06-21'
UNION ALL
SELECT
B.userid,
B.mobile,
B.regdate,
B.start_date,
CASE
WHEN B.end_date = '9999-12-31' AND A.userid IS NOT NULL
THEN date_add('2020-06-21', -1)
ELSE B.end_date
END AS end_date
FROM
(SELECT * FROM test.userinfo WHERE dt = '2020-06-21') A
RIGHT JOIN
test.userhis B
ON
A.userid = B.userid;
Zipper Table Test Script
vim test_zipper.sh
#!/bin/bash
# Load environment variables
source /etc/profile
# Check if date parameter is passed, if not use previous day's date
if [ -n "$1" ]; then
do_date=$1
else
do_date=$(date -d "-1 day" +%F)
fi
# SQL statement
sql="
INSERT OVERWRITE TABLE test.userhis
SELECT
userid,
mobile,
regdate,
dt AS start_date,
'9999-12-31' AS end_date
FROM
test.userinfo
WHERE
dt = '$do_date'
UNION ALL
SELECT
B.userid,
B.mobile,
B.regdate,
B.start_date,
CASE
WHEN B.end_date = '9999-12-31' AND A.userid IS NOT NULL THEN date_add('$do_date', -1)
ELSE B.end_date
END AS end_date
FROM
(SELECT * FROM test.userinfo WHERE dt = '$do_date') A
RIGHT JOIN
test.userhis B
ON
A.userid = B.userid;
"
# Execute Hive SQL
hive -e "$sql"
Zipper Table Rollback
Due to various reasons, if the zipper table needs to be restored to the data of rollback_date, then:
- end_date < rollback_date: That is, end date < rollback date, means this row of data was generated before roll_back_date, these data need to be kept as is
- start_date <= rollback_date <= end_date: That is, start date <= rollback date <= end date, these data were generated after the rollback date, but need modification, change end_date to 9999-12-31
- Other data does not need processing
Code according to the above scheme:
Process data where end_date < rollback_date
SELECT
userid,
mobile,
regdate,
start_date,
end_date,
'1' AS tag
FROM
test.userhis
WHERE
end_date < '2020-06-22';
Process data where start_date <= rollback_date <= end_date, set end_date=9999-12-31
SELECT
userid,
mobile,
regdate,
start_date,
'9999-12-31' AS end_date,
'2' AS tag
FROM
test.userhis
WHERE
start_date <= '2020-06-22'
AND end_date >= '2020-06-22';
Write the data from the previous two steps to temporary table tmp (zipper table)
-- Drop temporary table
DROP TABLE IF EXISTS test.tmp;
-- Create temporary table
CREATE TABLE test.tmp AS
SELECT
userid,
mobile,
regdate,
start_date,
end_date,
'1' AS tag
FROM
test.userhis
WHERE
end_date < '2020-06-22'
UNION ALL
SELECT
userid,
mobile,
regdate,
start_date,
'9999-12-31' AS end_date,
'2' AS tag
FROM
test.userhis
WHERE
start_date <= '2020-06-22'
AND end_date >= '2020-06-22';
-- Query result and cluster by userid and start_date
SELECT *
FROM test.tmp
CLUSTER BY userid, start_date;
Simulation script:
zippertmp.sh
#!/bin/bash
# Load environment variables
source /etc/profile
# Check if date parameter is passed, if not use previous day's date
if [ -n "$1" ]; then
do_date=$1
else
do_date=$(date -d "-1 day" +%F)
fi
# Define SQL query statement
sql="
DROP TABLE IF EXISTS test.tmp;
CREATE TABLE test.tmp AS
SELECT userid, mobile, regdate, start_date, end_date, '1' AS tag
FROM test.userhis
WHERE end_date < '${do_date}'
UNION ALL
SELECT userid, mobile, regdate, start_date, '9999-12-31' AS end_date, '2' AS tag
FROM test.userhis
WHERE start_date <= '${do_date}'
AND end_date >= '${do_date}';
"
# Execute Hive query
hive -e "$sql"
Error Quick Reference
| Symptom | Root Cause | Location | Fix |
|---|---|---|---|
| Same userid appears multiple times with 9999-12-31 in zipper table | Script executed multiple times on same day, or insufficient deduplication of incremental data | Check number of current valid records after grouping by userid | Deduplicate userinfo first; verify target date not already processed before execution |
| Historical version end_date not closed | Join condition only by userid, but incremental partition data missing or script date error | Check if dt=‘$do_date’ partition exists, check A.userid IS NOT NULL hit rate | Fill missing partition data; add empty partition protection; do data pre-check before execution |
| SQL reports date function error in Hive | Mixed MySQL and Hive date function syntax | Execute date_add related SQL fragments separately for verification | Use consistent date_add(‘2020-06-21’, -1) style in Hive |
| Table data abnormal after INSERT OVERWRITE | Target table rewritten entirely, but source data range incomplete | Compare total count, current valid count, historical count before and after rewrite | Add backup or temp table; write to temp result for verification first, then overwrite formal table |
| Rollback result incorrect | Rollback logic only generates tmp, not fully rewritten, or condition boundary wrong | Check if start_date/end_date in tmp matches rollback day | Clarify final step: use tmp to overwrite userhis; verify <=, >= boundaries |
| New user history broken | Initial load and next day load field口径 inconsistent | Compare userinfo and userhis field order, type, format | Keep field order consistent, date field unified format |
| Shell script execution fails | Hive environment variables not loaded, or executing user lacks permission | Manually execute source /etc/profile && hive -e “show databases” | 固化 Hive environment; print environment and execution date at script start |
| No data for the day but execution succeeds with incorrect result | Empty partition participates in UNION ALL, causing only history kept or wrong chain closing | First check select count(1) from userinfo where dt=‘$do_date’ | Exit directly if no data; add pre-check |
| Query snapshot for a specific day inaccurate | Query condition written wrong when using zipper table, not filtering by valid interval | Check if querying by start_date <= day and end_date >= day | 固化 snapshot query template, avoid directly querying current valid records |
| Task becomes very slow after data volume increases | Full table INSERT OVERWRITE + JOIN cost high, table not partitioned/bucketed | Check execution plan, shuffle, table size | Optimize zipper table by business dimension or date; introduce intermediate layer and partition strategy when necessary |