Big Data 246 - Offline Data Warehouse - Hive Zipper Table Practice: Initialization, Incremental Update, Rollback Script

Zipper Table Implementation

userinfo (partitioned table) => userid, mobile, regdate => Daily changed data (modified + new) / Historical data (first day) userhis (zipper table) => Two additional fields: start_date / end_date

Zipper Table

A zipper table is a database design pattern used to track data changes over time while maintaining efficient query performance. This pattern is widely used in data warehouse and data analysis scenarios because it can well record historical data changes.

Basic Concepts of Zipper Table

The core idea of a zipper table is to store the effective time range of each record, recording version changes through a “zipper” method. Each record contains the following key information:

  • Start Date/Effective Date: Represents the effective time of this record.
  • End Date/Expiration Date: Represents the expiration time of this record.
  • Is Current: Represents whether this record is the latest version (usually stored through a flag, e.g., 1 for current record, 0 for historical record).

Working Principle

  • New Data: When new data is inserted, the system creates a new record, sets its start time to the current time, end time to a default maximum time (e.g., 9999-12-31), and sets is_current field to 1.
  • Update Data: First, update the end time of the existing valid record to the current time, indicating its validity period has ended, and set is_current flag to 0. Then insert a new record representing the updated version, with start time as current time, end time as default maximum time, and is_current flag as 1.
  • Delete Data: Generally implemented through logical deletion (updating end time and is_current field) rather than direct physical deletion.

Prepare Data

-- 1. Initialize userinfo (2020-06-20). Get historical data
001,13551111111,2020-03-01,2020-06-20
002,13561111111,2020-04-01,2020-06-20
003,13571111111,2020-05-01,2020-06-20
004,13581111111,2020-06-01,2020-06-20

Initialize Zipper Table

Write data from 2020-06-20 to the table

-- 2. Initialize zipper table (2020-06-20). userinfo => userhis
INSERT OVERWRITE TABLE test.userhis
SELECT
    userid,
    mobile,
    regdate,
    dt AS start_date,
    '9999-12-31' AS end_date
FROM
    test.userinfo
WHERE
    dt = '2020-06-20';

Continue Preparing Data

-- 3. Next day new data (2020-06-21); Get new data
002,13562222222,2020-04-01,2020-06-21
004,13582222222,2020-06-01,2020-06-21
005,13552222222,2020-06-21,2020-06-21

Build Zipper Table

-- 4. Build zipper table (userhis) (2020-06-21) 【Core】 userinfo(2020-06-21) + userhis => userhis

-- Step 1: Process new data 【userinfo】(Similar logic to loading historical data)
SELECT
    userid,
    mobile,
    regdate,
    dt AS start_date,
    '9999-12-31' AS end_date
FROM
    test.userinfo
WHERE
    dt = '2020-06-21';

-- Step 2: Process historical data 【userhis】(Historical includes two parts: changed, unchanged)
-- Changed: start_date: unchanged; end_date: date - 1
-- Unchanged: no processing

-- Observe data
SELECT
    A.userid,
    B.userid,
    B.mobile,
    B.regdate,
    B.start_date,
    B.end_date
FROM
    (SELECT *
     FROM test.userinfo
     WHERE dt = '2020-06-21') A
RIGHT JOIN
    test.userhis B
ON
    A.userid = B.userid;

-- Write SQL to process historical data
SELECT
    B.userid,
    B.mobile,
    B.regdate,
    B.start_Date,
    CASE
        WHEN B.end_date = '9999-12-31' AND A.userid IS NOT NULL
        THEN DATE_ADD('2020-06-21', INTERVAL -1 DAY)
        ELSE B.end_date
    END AS end_date
FROM
    (SELECT * FROM test.userinfo WHERE dt = '2020-06-21') A
RIGHT JOIN
    test.userhis B
ON
    A.userid = B.userid;

-- Final processing (new data + historical data)
INSERT OVERWRITE TABLE test.userhis
SELECT
    userid,
    mobile,
    regdate,
    dt AS start_date,
    '9999-12-31' AS end_date
FROM
    test.userinfo
WHERE
    dt = '2020-06-21'
UNION ALL
SELECT
    B.userid,
    B.mobile,
    B.regdate,
    B.start_date,
    CASE
        WHEN B.end_date = '9999-12-31' AND A.userid IS NOT NULL
        THEN date_add('2020-06-21', -1)
        ELSE B.end_date
    END AS end_date
FROM
    (SELECT * FROM test.userinfo WHERE dt = '2020-06-21') A
RIGHT JOIN
    test.userhis B
ON
    A.userid = B.userid;

Zipper Table Test Script

vim test_zipper.sh

#!/bin/bash
# Load environment variables
source /etc/profile

# Check if date parameter is passed, if not use previous day's date
if [ -n "$1" ]; then
    do_date=$1
else
    do_date=$(date -d "-1 day" +%F)
fi

# SQL statement
sql="
INSERT OVERWRITE TABLE test.userhis
SELECT
    userid,
    mobile,
    regdate,
    dt AS start_date,
    '9999-12-31' AS end_date
FROM
    test.userinfo
WHERE
    dt = '$do_date'
UNION ALL
SELECT
    B.userid,
    B.mobile,
    B.regdate,
    B.start_date,
    CASE
        WHEN B.end_date = '9999-12-31' AND A.userid IS NOT NULL THEN date_add('$do_date', -1)
        ELSE B.end_date
    END AS end_date
FROM
    (SELECT * FROM test.userinfo WHERE dt = '$do_date') A
RIGHT JOIN
    test.userhis B
ON
    A.userid = B.userid;
"

# Execute Hive SQL
hive -e "$sql"

Zipper Table Rollback

Due to various reasons, if the zipper table needs to be restored to the data of rollback_date, then:

  • end_date < rollback_date: That is, end date < rollback date, means this row of data was generated before roll_back_date, these data need to be kept as is
  • start_date <= rollback_date <= end_date: That is, start date <= rollback date <= end date, these data were generated after the rollback date, but need modification, change end_date to 9999-12-31
  • Other data does not need processing

Code according to the above scheme:

Process data where end_date < rollback_date

SELECT
    userid,
    mobile,
    regdate,
    start_date,
    end_date,
    '1' AS tag
FROM
    test.userhis
WHERE
    end_date < '2020-06-22';

Process data where start_date <= rollback_date <= end_date, set end_date=9999-12-31

SELECT
    userid,
    mobile,
    regdate,
    start_date,
    '9999-12-31' AS end_date,
    '2' AS tag
FROM
    test.userhis
WHERE
    start_date <= '2020-06-22'
    AND end_date >= '2020-06-22';

Write the data from the previous two steps to temporary table tmp (zipper table)

-- Drop temporary table
DROP TABLE IF EXISTS test.tmp;

-- Create temporary table
CREATE TABLE test.tmp AS
SELECT
    userid,
    mobile,
    regdate,
    start_date,
    end_date,
    '1' AS tag
FROM
    test.userhis
WHERE
    end_date < '2020-06-22'

UNION ALL

SELECT
    userid,
    mobile,
    regdate,
    start_date,
    '9999-12-31' AS end_date,
    '2' AS tag
FROM
    test.userhis
WHERE
    start_date <= '2020-06-22'
    AND end_date >= '2020-06-22';

-- Query result and cluster by userid and start_date
SELECT *
FROM test.tmp
CLUSTER BY userid, start_date;

Simulation script:

zippertmp.sh

#!/bin/bash

# Load environment variables
source /etc/profile

# Check if date parameter is passed, if not use previous day's date
if [ -n "$1" ]; then
  do_date=$1
else
  do_date=$(date -d "-1 day" +%F)
fi

# Define SQL query statement
sql="
  DROP TABLE IF EXISTS test.tmp;

  CREATE TABLE test.tmp AS
  SELECT userid, mobile, regdate, start_date, end_date, '1' AS tag
  FROM test.userhis
  WHERE end_date < '${do_date}'

  UNION ALL

  SELECT userid, mobile, regdate, start_date, '9999-12-31' AS end_date, '2' AS tag
  FROM test.userhis
  WHERE start_date <= '${do_date}'
  AND end_date >= '${do_date}';
"

# Execute Hive query
hive -e "$sql"

Error Quick Reference

SymptomRoot CauseLocationFix
Same userid appears multiple times with 9999-12-31 in zipper tableScript executed multiple times on same day, or insufficient deduplication of incremental dataCheck number of current valid records after grouping by useridDeduplicate userinfo first; verify target date not already processed before execution
Historical version end_date not closedJoin condition only by userid, but incremental partition data missing or script date errorCheck if dt=‘$do_date’ partition exists, check A.userid IS NOT NULL hit rateFill missing partition data; add empty partition protection; do data pre-check before execution
SQL reports date function error in HiveMixed MySQL and Hive date function syntaxExecute date_add related SQL fragments separately for verificationUse consistent date_add(‘2020-06-21’, -1) style in Hive
Table data abnormal after INSERT OVERWRITETarget table rewritten entirely, but source data range incompleteCompare total count, current valid count, historical count before and after rewriteAdd backup or temp table; write to temp result for verification first, then overwrite formal table
Rollback result incorrectRollback logic only generates tmp, not fully rewritten, or condition boundary wrongCheck if start_date/end_date in tmp matches rollback dayClarify final step: use tmp to overwrite userhis; verify <=, >= boundaries
New user history brokenInitial load and next day load field口径 inconsistentCompare userinfo and userhis field order, type, formatKeep field order consistent, date field unified format
Shell script execution failsHive environment variables not loaded, or executing user lacks permissionManually execute source /etc/profile && hive -e “show databases”固化 Hive environment; print environment and execution date at script start
No data for the day but execution succeeds with incorrect resultEmpty partition participates in UNION ALL, causing only history kept or wrong chain closingFirst check select count(1) from userinfo where dt=‘$do_date’Exit directly if no data; add pre-check
Query snapshot for a specific day inaccurateQuery condition written wrong when using zipper table, not filtering by valid intervalCheck if querying by start_date <= day and end_date >= day固化 snapshot query template, avoid directly querying current valid records
Task becomes very slow after data volume increasesFull table INSERT OVERWRITE + JOIN cost high, table not partitioned/bucketedCheck execution plan, shuffle, table sizeOptimize zipper table by business dimension or date; introduce intermediate layer and partition strategy when necessary