TL;DR
- Scenario: Offline data warehouse calculates 1/2/3 day member retention count and retention rate based on new member and startup logs
- Conclusion: DWS first creates retention details for “new member date x retention days”, then ADS aggregates by new member date and calculates retention rate by filling in the denominator
- Output: Hive DWS/ADS table structures, loading scripts, calculation logic, and key connection fix points
Retention Member Concept
- Retention members and retention rate: New members added during a certain period, after some time, who continue to use the app are considered retained members. The proportion of these members to the original new members is the retention rate.
- Requirements: 1-day, 2-day, 3-day member retention rates and member retention counts
DWS Layer Function
- Unified Data Model: Integrates, cleans, and processes raw data (ODS layer) according to certain logical models to form standardized data structures
- Support Business Scenarios: Meets enterprise requirements for historical data query and analysis, supports OLAP operations
- Data Refinement and Classification: Classifies data by subject domain for easier management and query
- Data Accuracy and Consistency: Processed data is validated to ensure logical relationships are correct
Create DWS Layer Table
-- Member Retention Detail
drop table if exists dws.dws_member_retention_day;
create table dws.dws_member_retention_day
(
`device_id` string,
`uid` string,
`app_v` string,
`os_type` string,
`language` string,
`channel` string,
`area` string,
`brand` string,
`add_date` string comment 'Member addition time',
`retention_date` int comment 'Retention days'
)COMMENT 'Daily Member Retention Detail'
PARTITIONED BY (`dt` string)
stored as parquet;
Load DWS Layer Data Script
vim /opt/wzk/hive/dws_load_member_retention_day.sh
Write content:
#!/bin/bash
source /etc/profile
if [ -n "$1" ] ;then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
drop table if exists tmp.tmp_member_retention;
create table tmp.tmp_member_retention as
(
select t2.device_id,
t2.uid,
t2.app_v,
t2.os_type,
t2.language,
t2.channel,
t2.area,
t2.brand,
t2.dt add_date,
1
from dws.dws_member_start_day t1 join dws.dws_member_add_day
t2 on t1.device_id=t2.device_id
where t2.dt=date_add('$do_date', -1)
and t1.dt='$do_date'
union all
select t2.device_id,
t2.uid,
t2.app_v,
t2.os_type,
t2.language,
t2.channel,
t2.area,
t2.brand,
t2.dt add_date,
2
from dws.dws_member_start_day t1 join dws.dws_member_add_day
t2 on t1.device_id=t2.device_id
where t2.dt=date_add('$do_date', -2)
and t1.dt='$do_date'
union all
select t2.device_id,
t2.uid,
t2.app_v,
t2.os_type,
t2.language,
t2.channel,
t2.area,
t2.brand,
t2.dt add_date,
3
from dws.dws_member_start_day t1 join dws.dws_member_add_day
t2 on t1.device_id=t2.device_id
where t2.dt=date_add('$do_date', -3)
and t1.dt='$do_date'
);
insert overwrite table dws.dws_member_retention_day
partition(dt='$do_date')
select * from tmp.tmp_member_retention;
"
hive -e "$sql"
ADS Layer Function
- Aggregation and Simplification: Aggregates data from multiple tables and subject domains in the DWS layer into simple, easy-to-use tables or views
- Business Application Oriented: Through designing wide tables or high-performance views, directly supports specific business scenarios and report requirements
- Data Distribution and Integration: Provides efficient query interfaces for front-end BI tools, reporting systems, or API services
- Lightweight and High Performance: Minimizes data volume while retaining key metrics most important to the business
Create ADS Layer Table
-- Member Retention Count
drop table if exists ads.ads_member_retention_count;
create table ads.ads_member_retention_count
(
`add_date` string comment 'Addition date',
`retention_day` int comment 'Retention days up to current date',
`retention_count` bigint comment 'Retention count'
) COMMENT 'Member Retention Count'
partitioned by(dt string)
row format delimited fields terminated by ',';
-- Member Retention Rate
drop table if exists ads.ads_member_retention_rate;
create table ads.ads_member_retention_rate
(
`add_date` string comment 'Addition date',
`retention_day` int comment 'Retention days up to current date',
`retention_count` bigint comment 'Retention count',
`new_mid_count` bigint comment 'New member count for that day',
`retention_ratio` decimal(10,2) comment 'Retention rate'
) COMMENT 'Member Retention Rate'
partitioned by(dt string)
row format delimited fields terminated by ',';
Load ADS Layer Data Script
vim /opt/wzk/hive/ads_load_member_retention.sh
Write content:
#!/bin/bash
source /etc/profile
if [ -n "$1" ] ;then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
insert overwrite table ads.ads_member_retention_count
partition (dt='$do_date')
select add_date, retention_date,
count(*) retention_count
from dws.dws_member_retention_day
where dt='$do_date'
group by add_date, retention_date;
insert overwrite table ads.ads_member_retention_rate
partition (dt='$do_date')
select t1.add_date,
t1.retention_day,
t1.retention_count,
t2.cnt,
t1.retention_count/t2.cnt*100
from ads.ads_member_retention_count t1 join
ads.ads_new_member_cnt t2 on t1.dt=t2.dt
where t1.dt='$do_date';
"
hive -e "$sql"
Important Note: The join condition in the last SQL should be: t1.add_date=t2.dt
Relationship Between DWS and ADS
- DWS is the upstream of ADS: The standardized, reusable data services provided by DWS are the foundation for ADS building
- Different functional positioning: DWS focuses more on data integration and standardization, while ADS focuses more on personalized business requirement satisfaction
- Data granularity: DWS has finer granularity; ADS is typically based on aggregation and filtering from DWS, outputting for specific requirements
Error Troubleshooting
| Symptom | Root Cause | Diagnosis | Fix |
|---|---|---|---|
| ADS retention rate cannot be calculated/abnormally large or small | Retention rate SQL join condition is wrong, using t1.dt=t2.dt instead of new member date alignment | Compare add_date in ads_member_retention_count with dt in ads_new_member_cnt | Change join to t1.add_date = t2.dt |
| DWS retention detail data abnormally inflated | dws_member_add_day and start_day may have multiple records for same device_id, causing join multiplication | Check if device_id is unique in dt partition of both detail tables | Add deduplication/primary key constraint before join |
| Retention day statistics missing (e.g., only calculating to day 3) | Script hardcoded union all 1/2/3 | Check union sections in dws_load_member_retention_day.sh | Abstract to N-day retention: use dimension table/number table to generate retention_day |
| Retention caliber deviation (cross-device, cross-account) | Using device_id as primary key, member dimension should use uid | Compare with business definition | Clarify caliber: if using uid retention, change join key to uid |
| Retention rate decimal precision/percentage display incorrect | retention_count/t2.cnt*100 may trigger integer division or insufficient precision | Check Hive expression type inference and final retention_ratio | Use cast: cast(t1.retention_count as decimal(18,4))/cast(t2.cnt as decimal(18,4))*100 |
| Partition overwrite/data loss | insert overwrite overwrites partition; script defaults to do_date=-1, accidental run will overwrite history | Check insert overwrite + dt=‘$do_date’ | Keep overwrite but add batch protection: parameter required or write protection |
| Temp table repeated creation affecting performance | Daily drop/create tmp table, union all scans start_day/add_day three times | Check explain/execution logs for scan count and shuffle | Change to single scan: get do_date startup set first, then calculate retention_day with add_day using case when |
| Mixed Parquet and row format tables causing inconsistent performance | DWS uses parquet, ADS uses textfile | Check ADS query performance, compression rate, file count | ADS also uses parquet/orc |