Big Data 232 - Hive New Member & Retention
TL;DR
- Scenario: Offline data warehouse calculates “new members” daily, and provides consistent data foundation for subsequent “member retention”
- Conclusion: Use “full member table (with first day dt)” as deduplication anchor, DWS outputs new member details, ADS outputs new member count, scripts run serially
- Output: Reusable Hive SQL/script templates: new member identification (left join is null) + full member table incremental maintenance + daily new member metric table
New Members
- Retained members: New members from a certain period, after some time, still using the app are considered retained members
- New members: Users who use the app for the first time are defined as new members. Devices that uninstall and reinstall will NOT be counted as new users
New members must be calculated first => then calculate member retention
Requirements Description
Daily new member count
-
08-02: DWD, member daily startup details (95-110); all member information (1-100)
- New members 101-110
- New member data + old member information = new all member information (1-110)
-
08-03: DWD, member daily startup details (100-120); all member information (1-110)
- New members: 111-120
- New member data + old all member information = new all member information (1-120)
Calculation steps:
- Calculate new members
- Update all member information
Improved Method
- Add a time column in all member information, indicating which day this member became a new member
- Only need one table: all member information (id, dt)
- Insert new members into the all member table
Case: How to Calculate New Members
-- t1.dat data is as follows
4,2020-08-02
5,2020-08-02
6,2020-08-02
7,2020-08-02
8,2020-08-02
9,2020-08-02
-- Daily startup table => DWS
drop table t1;
create table t1(id int, dt string) row format delimited fields terminated by ',';
load data local inpath '/opt/wzk/hive/data/t1.dat' into table t1;
-- t2.dat data is as follows
1,2020-08-01
2,2020-08-01
3,2020-08-01
4,2020-08-01
5,2020-08-01
6,2020-08-01
-- Full data => DWS
drop table t2;
create table t2(id int, dt string)
row format delimited fields terminated by ',';
load data local inpath '/opt/wzk/hive/data/t2.dat' into table t2;
-- Find new users on 2020-08-02
select t1.id, t1.dt, t2.id, t2.dt
from t1 left join t2 on t1.id=t2.id
where t1.dt="2020-08-02";
select t1.id, t1.dt
from t1 left join t2 on t1.id=t2.id
where t1.dt="2020-08-02"
and t2.id is null;
-- Insert found new users into t2 table
insert into table t2
select t1.id, t1.dt
from t1 left join t2 on t1.id=t2.id
where t1.dt="2020-08-02"
and t2.id is null;
t2 data result:
7 2020-08-02
8 2020-08-02
9 2020-08-02
1 2020-08-01
2 2020-08-01
3 2020-08-01
4 2020-08-01
5 2020-08-01
6 2020-08-01
Continue loading new data:
-- t3.dat data, t1 loads 2020-08-03 data
14,2020-08-03
15,2020-08-03
16,2020-08-03
17,2020-08-03
18,2020-08-03
19,2020-08-03
load data local inpath '/opt/wzk/hive/data/t3.dat' into table t1;
-- Insert found new users into t2 table
insert into table t2
select t1.id, t1.dt
from t1 left join t2 on t1.id=t2.id
where t1.dt="2020-08-03"
and t2.id is null;
-- Check result
select * from t2;
Final t2 data:
7 2020-08-02
8 2020-08-02
9 2020-08-02
14 2020-08-03
15 2020-08-03
16 2020-08-03
17 2020-08-03
18 2020-08-03
19 2020-08-03
1 2020-08-01
2 2020-08-01
3 2020-08-01
4 2020-08-01
5 2020-08-01
6 2020-08-01
DWS Layer Role
Unified Data Model
Integrate, clean, and process raw data (ODS layer) according to certain logical models to form standardized data structures. Support multi-dimensional and multi-granularity analysis of data.
Support Business Scenarios
Meet enterprise needs for querying and analyzing historical data. Support OLAP (Online Analytical Processing) operations such as aggregate queries, drilling, and slicing.
Data Refinement and Classification
Classify data by subject areas (such as sales, finance, inventory, etc.) for easier management and querying. Typically maintain high detail granularity for flexible expansion.
Data Accuracy and Consistency
Processed data is validated to ensure logical relationships are correct, providing accurate and consistent data for downstream systems.
Create DWS Layer Table
use dws;
drop table if exists dws.dws_member_add_day;
create table dws.dws_member_add_day
(
`device_id` string,
`uid` string,
`app_v` string,
`os_type` string,
`language` string,
`channel` string,
`area` string,
`brand` string,
`dt` string
) COMMENT 'Daily new member detail'
stored as parquet;
Load DWS Layer Data
Script /opt/wzk/hive/dws_load_member_add_day.sh:
#!/bin/bash
source /etc/profile
if [ -n "$1" ]
then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
insert into table dws.dws_member_add_day
select t1.device_id,
t1.uid,
t1.app_v,
t1.os_type,
t1.language,
t1.channel,
t1.area,
t1.brand,
'$do_date'
from dws.dws_member_start_day t1 left join
dws.dws_member_add_day t2
on t1.device_id=t2.device_id
where t1.dt='$do_date'
and t2.device_id is null;
"
hive -e "$sql"
ADS Layer Role
Aggregate and Simplify Data
Aggregate data from multiple tables and subject areas in the DWS layer into simple, easy-to-use tables or views. Directly output data results that meet business needs.
Business-Oriented Applications
Design wide tables or high-performance views to directly support specific business scenarios and reporting needs. Respond to fast query needs such as real-time data display.
Data Distribution and Integration
Provide efficient query interfaces for front-end BI tools, reporting systems, or API services. Accelerate query performance through caching mechanisms or materialized views.
Lightweight and High Performance
Minimize data volume, retaining only key metrics that the business cares about. Use pre-aggregation and pre-computation techniques to improve query efficiency.
Create ADS Layer Table
drop table if exists ads.ads_new_member_cnt;
create table ads.ads_new_member_cnt(
`cnt` string
)
partitioned by(dt string)
row format delimited fields terminated by ',';
Load ADS Layer Data
Script /opt/wzk/hive/ads_load_member_add.sh:
#!/bin/bash
source /etc/profile
if [ -n "$1" ] ;then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
insert overwrite table ads.ads_new_member_cnt
partition (dt='$do_date')
select count(1)
from dws.dws_member_add_day
where dt = '$do_date'
"
hive -e "$sql"
Brief Summary
Script execution order:
dws_load_member_add_day.shads_load_member_add.sh
Error Quick Reference
| Symptom | Root Cause Analysis | Fix |
|---|---|---|
| Daily new member count fluctuates wildly | New member anchor is unstable (device_id changes/uid changes) | Compare same-day uid vs device_id deduplication cardinality; clarify new member definition: prefer stable primary key; if needed, create device_id↔uid binding table |
| New member details appear repeatedly | Same user’s full member table incremental write failed or rolled back | Check if historical dt for this id exists in t2/full member table; ensure “insert into full member table” succeeds before downstream output; add idempotent constraints to inserts (partition/dedup) |
| New member details are 0, but startup details have data | Join condition/field type inconsistency causes full match failure | Sample check t1.id vs t2.id types, whether there are spaces/format differences; unify field types and cleaning rules; trim/cast before join if needed |
| ADS partition overwrite error or data goes to wrong date | do_date parameter/default date value doesn’t match scheduling time | Print script do_date and SQL where dt; unify scheduling time baseline; force scheduling parameter, avoid default -1 day error |
| ADS result accumulates instead of overwrites | Wrong insert into used, or partition not specified causing append | Check if using insert overwrite + partition(dt=); metric tables use insert overwrite to overwrite daily partition; detail tables decide append based on requirements |
| New member definition doesn’t match “uninstall reinstall doesn’t count as new” | Only using device_id cannot identify “same device uninstall reinstall” boundary | Count first/multiple occurrence patterns for same device_id; maintain first dt with device_id as primary key; if device reset causes changes, need risk control/埋点 enhancement |
| DWS/ADS serial execution still produces empty result | DWS table dt field write doesn’t match where dt | Verify DWS insert dt value is ‘$do_date’; ensure DWS write dt and query dt come from same source; avoid mixing t1.dt with script dt |