Offline Data Warehouse Retention Rate Implementation: DWS...

TL;DR

Scenario: Offline data warehouse calculates 1/2/3 day member retention count and retention rate based on new member and startup logs
Conclusion: DWS first creates retention details for “new member date x retention days”, then ADS aggregates by new member date and calculates retention rate by filling in the denominator
Output: Hive DWS/ADS table structures, loading scripts, calculation logic, and key connection fix points

Retention Member Concept

Retention members and retention rate: New members added during a certain period, after some time, who continue to use the app are considered retained members. The proportion of these members to the original new members is the retention rate.
Requirements: 1-day, 2-day, 3-day member retention rates and member retention counts

DWS Layer Function

Unified Data Model: Integrates, cleans, and processes raw data (ODS layer) according to certain logical models to form standardized data structures
Support Business Scenarios: Meets enterprise requirements for historical data query and analysis, supports OLAP operations
Data Refinement and Classification: Classifies data by subject domain for easier management and query
Data Accuracy and Consistency: Processed data is validated to ensure logical relationships are correct

Create DWS Layer Table

-- Member Retention Detail
drop table if exists dws.dws_member_retention_day;
create table dws.dws_member_retention_day
(
  `device_id` string,
  `uid` string,
  `app_v` string,
  `os_type` string,
  `language` string,
  `channel` string,
  `area` string,
  `brand` string,
  `add_date` string comment 'Member addition time',
  `retention_date` int comment 'Retention days'
)COMMENT 'Daily Member Retention Detail'
PARTITIONED BY (`dt` string)
stored as parquet;

Load DWS Layer Data Script

vim /opt/wzk/hive/dws_load_member_retention_day.sh

Write content:

#!/bin/bash
source /etc/profile
if [ -n "$1" ] ;then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
drop table if exists tmp.tmp_member_retention;
create table tmp.tmp_member_retention as
(
  select t2.device_id,
  t2.uid,
  t2.app_v,
  t2.os_type,
  t2.language,
  t2.channel,
  t2.area,
  t2.brand,
  t2.dt add_date,
  1
  from dws.dws_member_start_day t1 join dws.dws_member_add_day
  t2 on t1.device_id=t2.device_id
  where t2.dt=date_add('$do_date', -1)
  and t1.dt='$do_date'
  union all
  select t2.device_id,
  t2.uid,
  t2.app_v,
  t2.os_type,
  t2.language,
  t2.channel,
  t2.area,
  t2.brand,
  t2.dt add_date,
  2
  from dws.dws_member_start_day t1 join dws.dws_member_add_day
  t2 on t1.device_id=t2.device_id
  where t2.dt=date_add('$do_date', -2)
  and t1.dt='$do_date'
  union all
  select t2.device_id,
  t2.uid,
  t2.app_v,
  t2.os_type,
  t2.language,
  t2.channel,
  t2.area,
  t2.brand,
  t2.dt add_date,
  3
  from dws.dws_member_start_day t1 join dws.dws_member_add_day
  t2 on t1.device_id=t2.device_id
  where t2.dt=date_add('$do_date', -3)
  and t1.dt='$do_date'
);
insert overwrite table dws.dws_member_retention_day
partition(dt='$do_date')
select * from tmp.tmp_member_retention;
"
hive -e "$sql"

ADS Layer Function

Aggregation and Simplification: Aggregates data from multiple tables and subject domains in the DWS layer into simple, easy-to-use tables or views
Business Application Oriented: Through designing wide tables or high-performance views, directly supports specific business scenarios and report requirements
Data Distribution and Integration: Provides efficient query interfaces for front-end BI tools, reporting systems, or API services
Lightweight and High Performance: Minimizes data volume while retaining key metrics most important to the business

Create ADS Layer Table

-- Member Retention Count
drop table if exists ads.ads_member_retention_count;
create table ads.ads_member_retention_count
(
  `add_date` string comment 'Addition date',
  `retention_day` int comment 'Retention days up to current date',
  `retention_count` bigint comment 'Retention count'
) COMMENT 'Member Retention Count'
partitioned by(dt string)
row format delimited fields terminated by ',';

-- Member Retention Rate
drop table if exists ads.ads_member_retention_rate;
create table ads.ads_member_retention_rate
(
  `add_date` string comment 'Addition date',
  `retention_day` int comment 'Retention days up to current date',
  `retention_count` bigint comment 'Retention count',
  `new_mid_count` bigint comment 'New member count for that day',
  `retention_ratio` decimal(10,2) comment 'Retention rate'
) COMMENT 'Member Retention Rate'
partitioned by(dt string)
row format delimited fields terminated by ',';

Load ADS Layer Data Script

vim /opt/wzk/hive/ads_load_member_retention.sh

Write content:

#!/bin/bash
source /etc/profile
if [ -n "$1" ] ;then
  do_date=$1
else
  do_date=`date -d "-1 day" +%F`
  fi
  sql="
insert overwrite table ads.ads_member_retention_count
partition (dt='$do_date')
select add_date, retention_date,
count(*) retention_count
from dws.dws_member_retention_day
where dt='$do_date'
group by add_date, retention_date;
insert overwrite table ads.ads_member_retention_rate
partition (dt='$do_date')
select t1.add_date,
t1.retention_day,
t1.retention_count,
t2.cnt,
t1.retention_count/t2.cnt*100
from ads.ads_member_retention_count t1 join
ads.ads_new_member_cnt t2 on t1.dt=t2.dt
where t1.dt='$do_date';
"
hive -e "$sql"

Important Note: The join condition in the last SQL should be: t1.add_date=t2.dt

Relationship Between DWS and ADS

DWS is the upstream of ADS: The standardized, reusable data services provided by DWS are the foundation for ADS building
Different functional positioning: DWS focuses more on data integration and standardization, while ADS focuses more on personalized business requirement satisfaction
Data granularity: DWS has finer granularity; ADS is typically based on aggregation and filtering from DWS, outputting for specific requirements

Error Troubleshooting

Symptom	Root Cause	Diagnosis	Fix
ADS retention rate cannot be calculated/abnormally large or small	Retention rate SQL join condition is wrong, using t1.dt=t2.dt instead of new member date alignment	Compare add_date in ads_member_retention_count with dt in ads_new_member_cnt	Change join to t1.add_date = t2.dt
DWS retention detail data abnormally inflated	dws_member_add_day and start_day may have multiple records for same device_id, causing join multiplication	Check if device_id is unique in dt partition of both detail tables	Add deduplication/primary key constraint before join
Retention day statistics missing (e.g., only calculating to day 3)	Script hardcoded union all 1/2/3	Check union sections in dws_load_member_retention_day.sh	Abstract to N-day retention: use dimension table/number table to generate retention_day
Retention caliber deviation (cross-device, cross-account)	Using device_id as primary key, member dimension should use uid	Compare with business definition	Clarify caliber: if using uid retention, change join key to uid
Retention rate decimal precision/percentage display incorrect	retention_count/t2.cnt*100 may trigger integer division or insufficient precision	Check Hive expression type inference and final retention_ratio	Use cast: cast(t1.retention_count as decimal(18,4))/cast(t2.cnt as decimal(18,4))*100
Partition overwrite/data loss	insert overwrite overwrites partition; script defaults to do_date=-1, accidental run will overwrite history	Check insert overwrite + dt=‘$do_date’	Keep overwrite but add batch protection: parameter required or write protection
Temp table repeated creation affecting performance	Daily drop/create tmp table, union all scans start_day/add_day three times	Check explain/execution logs for scan count and shuffle	Change to single scan: get do_date startup set first, then calculate retention_day with add_day using case when
Mixed Parquet and row format tables causing inconsistent performance	DWS uses parquet, ADS uses textfile	Check ADS query performance, compression rate, file count	ADS also uses parquet/orc