Big Data 232 - Hive New Member & Retention

TL;DR

  • Scenario: Offline data warehouse calculates “new members” daily, and provides consistent data foundation for subsequent “member retention”
  • Conclusion: Use “full member table (with first day dt)” as deduplication anchor, DWS outputs new member details, ADS outputs new member count, scripts run serially
  • Output: Reusable Hive SQL/script templates: new member identification (left join is null) + full member table incremental maintenance + daily new member metric table

New Members

  • Retained members: New members from a certain period, after some time, still using the app are considered retained members
  • New members: Users who use the app for the first time are defined as new members. Devices that uninstall and reinstall will NOT be counted as new users

New members must be calculated first => then calculate member retention

Requirements Description

Daily new member count

  • 08-02: DWD, member daily startup details (95-110); all member information (1-100)

    • New members 101-110
    • New member data + old member information = new all member information (1-110)
  • 08-03: DWD, member daily startup details (100-120); all member information (1-110)

    • New members: 111-120
    • New member data + old all member information = new all member information (1-120)

Calculation steps:

  1. Calculate new members
  2. Update all member information

Improved Method

  • Add a time column in all member information, indicating which day this member became a new member
  • Only need one table: all member information (id, dt)
  • Insert new members into the all member table

Case: How to Calculate New Members

-- t1.dat data is as follows
4,2020-08-02
5,2020-08-02
6,2020-08-02
7,2020-08-02
8,2020-08-02
9,2020-08-02

-- Daily startup table => DWS
drop table t1;
create table t1(id int, dt string) row format delimited fields terminated by ',';
load data local inpath '/opt/wzk/hive/data/t1.dat' into table t1;
-- t2.dat data is as follows
1,2020-08-01
2,2020-08-01
3,2020-08-01
4,2020-08-01
5,2020-08-01
6,2020-08-01

-- Full data => DWS
drop table t2;
create table t2(id int, dt string)
row format delimited fields terminated by ',';
load data local inpath '/opt/wzk/hive/data/t2.dat' into table t2;
-- Find new users on 2020-08-02
select t1.id, t1.dt, t2.id, t2.dt
from t1 left join t2 on t1.id=t2.id
where t1.dt="2020-08-02";

select t1.id, t1.dt
from t1 left join t2 on t1.id=t2.id
where t1.dt="2020-08-02"
and t2.id is null;
-- Insert found new users into t2 table
insert into table t2
select t1.id, t1.dt
from t1 left join t2 on t1.id=t2.id
where t1.dt="2020-08-02"
and t2.id is null;

t2 data result:

7       2020-08-02
8       2020-08-02
9       2020-08-02
1       2020-08-01
2       2020-08-01
3       2020-08-01
4       2020-08-01
5       2020-08-01
6       2020-08-01

Continue loading new data:

-- t3.dat data, t1 loads 2020-08-03 data
14,2020-08-03
15,2020-08-03
16,2020-08-03
17,2020-08-03
18,2020-08-03
19,2020-08-03

load data local inpath '/opt/wzk/hive/data/t3.dat' into table t1;

-- Insert found new users into t2 table
insert into table t2
select t1.id, t1.dt
from t1 left join t2 on t1.id=t2.id
where t1.dt="2020-08-03"
and t2.id is null;

-- Check result
select * from t2;

Final t2 data:

7       2020-08-02
8       2020-08-02
9       2020-08-02
14      2020-08-03
15      2020-08-03
16      2020-08-03
17      2020-08-03
18      2020-08-03
19      2020-08-03
1       2020-08-01
2       2020-08-01
3       2020-08-01
4       2020-08-01
5       2020-08-01
6       2020-08-01

DWS Layer Role

Unified Data Model

Integrate, clean, and process raw data (ODS layer) according to certain logical models to form standardized data structures. Support multi-dimensional and multi-granularity analysis of data.

Support Business Scenarios

Meet enterprise needs for querying and analyzing historical data. Support OLAP (Online Analytical Processing) operations such as aggregate queries, drilling, and slicing.

Data Refinement and Classification

Classify data by subject areas (such as sales, finance, inventory, etc.) for easier management and querying. Typically maintain high detail granularity for flexible expansion.

Data Accuracy and Consistency

Processed data is validated to ensure logical relationships are correct, providing accurate and consistent data for downstream systems.

Create DWS Layer Table

use dws;
drop table if exists dws.dws_member_add_day;
create table dws.dws_member_add_day
(
  `device_id` string,
  `uid` string,
  `app_v` string,
  `os_type` string,
  `language` string,
  `channel` string,
  `area` string,
  `brand` string,
  `dt` string
) COMMENT 'Daily new member detail'
stored as parquet;

Load DWS Layer Data

Script /opt/wzk/hive/dws_load_member_add_day.sh:

#!/bin/bash
source /etc/profile
if [ -n "$1" ]
then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
insert into table dws.dws_member_add_day
select t1.device_id,
t1.uid,
t1.app_v,
t1.os_type,
t1.language,
t1.channel,
t1.area,
t1.brand,
'$do_date'
from dws.dws_member_start_day t1 left join
dws.dws_member_add_day t2
on t1.device_id=t2.device_id
where t1.dt='$do_date'
and t2.device_id is null;
"
hive -e "$sql"

ADS Layer Role

Aggregate and Simplify Data

Aggregate data from multiple tables and subject areas in the DWS layer into simple, easy-to-use tables or views. Directly output data results that meet business needs.

Business-Oriented Applications

Design wide tables or high-performance views to directly support specific business scenarios and reporting needs. Respond to fast query needs such as real-time data display.

Data Distribution and Integration

Provide efficient query interfaces for front-end BI tools, reporting systems, or API services. Accelerate query performance through caching mechanisms or materialized views.

Lightweight and High Performance

Minimize data volume, retaining only key metrics that the business cares about. Use pre-aggregation and pre-computation techniques to improve query efficiency.

Create ADS Layer Table

drop table if exists ads.ads_new_member_cnt;
create table ads.ads_new_member_cnt(
  `cnt` string
)
partitioned by(dt string)
row format delimited fields terminated by ',';

Load ADS Layer Data

Script /opt/wzk/hive/ads_load_member_add.sh:

#!/bin/bash
source /etc/profile
if [ -n "$1" ] ;then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
insert overwrite table ads.ads_new_member_cnt
partition (dt='$do_date')
select count(1)
from dws.dws_member_add_day
where dt = '$do_date'
"
hive -e "$sql"

Brief Summary

Script execution order:

  1. dws_load_member_add_day.sh
  2. ads_load_member_add.sh

Error Quick Reference

SymptomRoot Cause AnalysisFix
Daily new member count fluctuates wildlyNew member anchor is unstable (device_id changes/uid changes)Compare same-day uid vs device_id deduplication cardinality; clarify new member definition: prefer stable primary key; if needed, create device_id↔uid binding table
New member details appear repeatedlySame user’s full member table incremental write failed or rolled backCheck if historical dt for this id exists in t2/full member table; ensure “insert into full member table” succeeds before downstream output; add idempotent constraints to inserts (partition/dedup)
New member details are 0, but startup details have dataJoin condition/field type inconsistency causes full match failureSample check t1.id vs t2.id types, whether there are spaces/format differences; unify field types and cleaning rules; trim/cast before join if needed
ADS partition overwrite error or data goes to wrong datedo_date parameter/default date value doesn’t match scheduling timePrint script do_date and SQL where dt; unify scheduling time baseline; force scheduling parameter, avoid default -1 day error
ADS result accumulates instead of overwritesWrong insert into used, or partition not specified causing appendCheck if using insert overwrite + partition(dt=); metric tables use insert overwrite to overwrite daily partition; detail tables decide append based on requirements
New member definition doesn’t match “uninstall reinstall doesn’t count as new”Only using device_id cannot identify “same device uninstall reinstall” boundaryCount first/multiple occurrence patterns for same device_id; maintain first dt with device_id as primary key; if device reset causes changes, need risk control/埋点 enhancement
DWS/ADS serial execution still produces empty resultDWS table dt field write doesn’t match where dtVerify DWS insert dt value is ‘$do_date’; ensure DWS write dt and query dt come from same source; avoid mixing t1.dt with script dt