TL;DR

  • Scenario: Offline data warehouse calculates 1/2/3 day member retention count and retention rate based on new member and startup logs
  • Conclusion: DWS first creates retention details for “new member date x retention days”, then ADS aggregates by new member date and calculates retention rate by filling in the denominator
  • Output: Hive DWS/ADS table structures, loading scripts, calculation logic, and key connection fix points

Retention Member Concept

  • Retention members and retention rate: New members added during a certain period, after some time, who continue to use the app are considered retained members. The proportion of these members to the original new members is the retention rate.
  • Requirements: 1-day, 2-day, 3-day member retention rates and member retention counts

DWS Layer Function

  • Unified Data Model: Integrates, cleans, and processes raw data (ODS layer) according to certain logical models to form standardized data structures
  • Support Business Scenarios: Meets enterprise requirements for historical data query and analysis, supports OLAP operations
  • Data Refinement and Classification: Classifies data by subject domain for easier management and query
  • Data Accuracy and Consistency: Processed data is validated to ensure logical relationships are correct

Create DWS Layer Table

-- Member Retention Detail
drop table if exists dws.dws_member_retention_day;
create table dws.dws_member_retention_day
(
  `device_id` string,
  `uid` string,
  `app_v` string,
  `os_type` string,
  `language` string,
  `channel` string,
  `area` string,
  `brand` string,
  `add_date` string comment 'Member addition time',
  `retention_date` int comment 'Retention days'
)COMMENT 'Daily Member Retention Detail'
PARTITIONED BY (`dt` string)
stored as parquet;

Load DWS Layer Data Script

vim /opt/wzk/hive/dws_load_member_retention_day.sh

Write content:

#!/bin/bash
source /etc/profile
if [ -n "$1" ] ;then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
drop table if exists tmp.tmp_member_retention;
create table tmp.tmp_member_retention as
(
  select t2.device_id,
  t2.uid,
  t2.app_v,
  t2.os_type,
  t2.language,
  t2.channel,
  t2.area,
  t2.brand,
  t2.dt add_date,
  1
  from dws.dws_member_start_day t1 join dws.dws_member_add_day
  t2 on t1.device_id=t2.device_id
  where t2.dt=date_add('$do_date', -1)
  and t1.dt='$do_date'
  union all
  select t2.device_id,
  t2.uid,
  t2.app_v,
  t2.os_type,
  t2.language,
  t2.channel,
  t2.area,
  t2.brand,
  t2.dt add_date,
  2
  from dws.dws_member_start_day t1 join dws.dws_member_add_day
  t2 on t1.device_id=t2.device_id
  where t2.dt=date_add('$do_date', -2)
  and t1.dt='$do_date'
  union all
  select t2.device_id,
  t2.uid,
  t2.app_v,
  t2.os_type,
  t2.language,
  t2.channel,
  t2.area,
  t2.brand,
  t2.dt add_date,
  3
  from dws.dws_member_start_day t1 join dws.dws_member_add_day
  t2 on t1.device_id=t2.device_id
  where t2.dt=date_add('$do_date', -3)
  and t1.dt='$do_date'
);
insert overwrite table dws.dws_member_retention_day
partition(dt='$do_date')
select * from tmp.tmp_member_retention;
"
hive -e "$sql"

ADS Layer Function

  • Aggregation and Simplification: Aggregates data from multiple tables and subject domains in the DWS layer into simple, easy-to-use tables or views
  • Business Application Oriented: Through designing wide tables or high-performance views, directly supports specific business scenarios and report requirements
  • Data Distribution and Integration: Provides efficient query interfaces for front-end BI tools, reporting systems, or API services
  • Lightweight and High Performance: Minimizes data volume while retaining key metrics most important to the business

Create ADS Layer Table

-- Member Retention Count
drop table if exists ads.ads_member_retention_count;
create table ads.ads_member_retention_count
(
  `add_date` string comment 'Addition date',
  `retention_day` int comment 'Retention days up to current date',
  `retention_count` bigint comment 'Retention count'
) COMMENT 'Member Retention Count'
partitioned by(dt string)
row format delimited fields terminated by ',';

-- Member Retention Rate
drop table if exists ads.ads_member_retention_rate;
create table ads.ads_member_retention_rate
(
  `add_date` string comment 'Addition date',
  `retention_day` int comment 'Retention days up to current date',
  `retention_count` bigint comment 'Retention count',
  `new_mid_count` bigint comment 'New member count for that day',
  `retention_ratio` decimal(10,2) comment 'Retention rate'
) COMMENT 'Member Retention Rate'
partitioned by(dt string)
row format delimited fields terminated by ',';

Load ADS Layer Data Script

vim /opt/wzk/hive/ads_load_member_retention.sh

Write content:

#!/bin/bash
source /etc/profile
if [ -n "$1" ] ;then
  do_date=$1
else
  do_date=`date -d "-1 day" +%F`
  fi
  sql="
insert overwrite table ads.ads_member_retention_count
partition (dt='$do_date')
select add_date, retention_date,
count(*) retention_count
from dws.dws_member_retention_day
where dt='$do_date'
group by add_date, retention_date;
insert overwrite table ads.ads_member_retention_rate
partition (dt='$do_date')
select t1.add_date,
t1.retention_day,
t1.retention_count,
t2.cnt,
t1.retention_count/t2.cnt*100
from ads.ads_member_retention_count t1 join
ads.ads_new_member_cnt t2 on t1.dt=t2.dt
where t1.dt='$do_date';
"
hive -e "$sql"

Important Note: The join condition in the last SQL should be: t1.add_date=t2.dt

Relationship Between DWS and ADS

  • DWS is the upstream of ADS: The standardized, reusable data services provided by DWS are the foundation for ADS building
  • Different functional positioning: DWS focuses more on data integration and standardization, while ADS focuses more on personalized business requirement satisfaction
  • Data granularity: DWS has finer granularity; ADS is typically based on aggregation and filtering from DWS, outputting for specific requirements

Error Troubleshooting

SymptomRoot CauseDiagnosisFix
ADS retention rate cannot be calculated/abnormally large or smallRetention rate SQL join condition is wrong, using t1.dt=t2.dt instead of new member date alignmentCompare add_date in ads_member_retention_count with dt in ads_new_member_cntChange join to t1.add_date = t2.dt
DWS retention detail data abnormally inflateddws_member_add_day and start_day may have multiple records for same device_id, causing join multiplicationCheck if device_id is unique in dt partition of both detail tablesAdd deduplication/primary key constraint before join
Retention day statistics missing (e.g., only calculating to day 3)Script hardcoded union all 1/2/3Check union sections in dws_load_member_retention_day.shAbstract to N-day retention: use dimension table/number table to generate retention_day
Retention caliber deviation (cross-device, cross-account)Using device_id as primary key, member dimension should use uidCompare with business definitionClarify caliber: if using uid retention, change join key to uid
Retention rate decimal precision/percentage display incorrectretention_count/t2.cnt*100 may trigger integer division or insufficient precisionCheck Hive expression type inference and final retention_ratioUse cast: cast(t1.retention_count as decimal(18,4))/cast(t2.cnt as decimal(18,4))*100
Partition overwrite/data lossinsert overwrite overwrites partition; script defaults to do_date=-1, accidental run will overwrite historyCheck insert overwrite + dt=‘$do_date’Keep overwrite but add batch protection: parameter required or write protection
Temp table repeated creation affecting performanceDaily drop/create tmp table, union all scans start_day/add_day three timesCheck explain/execution logs for scan count and shuffleChange to single scan: get do_date startup set first, then calculate retention_day with add_day using case when
Mixed Parquet and row format tables causing inconsistent performanceDWS uses parquet, ADS uses textfileCheck ADS query performance, compression rate, file countADS also uses parquet/orc