Overview
This article introduces using Hive to build an offline data warehouse for calculating active members (daily/weekly/monthly active users), covering the complete flow from DWD detail layer to DWS summary layer to ADS application layer.
Core Concepts
Active Members: Members who opened the application are considered active members.
Active Member Metrics Requirements: Daily, weekly, and monthly active member counts.
Data Flow:
- DWD: Daily member startup detail information
- DWS: Daily/weekly/monthly active member information
- ADS: Daily, weekly, monthly active member counts (output)
DWS Layer Table Creation Script
use dws;
drop table if exists dws.dws_member_start_day;
create table dws.dws_member_start_day
(
`device_id` string,
`uid` string,
`app_v` string,
`os_type` string,
`language` string,
`channel` string,
`area` string,
`brand` string
) COMMENT 'Member Daily Startup Summary'
partitioned by(dt string)
stored as parquet;
drop table if exists dws.dws_member_start_week;
create table dws.dws_member_start_week(
`device_id` string,
`uid` string,
`app_v` string,
`os_type` string,
`language` string,
`channel` string,
`area` string,
`brand` string,
`week` string
) COMMENT 'Member Weekly Startup Summary'
PARTITIONED BY (`dt` string)
stored as parquet;
drop table if exists dws.dws_member_start_month;
create table dws.dws_member_start_month(
`device_id` string,
`uid` string,
`app_v` string,
`os_type` string,
`language` string,
`channel` string,
`area` string,
`brand` string,
`month` string
) COMMENT 'Member Monthly Startup Summary'
PARTITIONED BY (`dt` string)
stored as parquet;
DWS Layer Data Loading Script
#!/bin/bash
source /etc/profile
if [ -n "$1" ]
then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
insert overwrite table dws.dws_member_start_day
partition(dt='$do_date')
select device_id,
concat_ws('|', collect_set(uid)),
concat_ws('|', collect_set(app_v)),
concat_ws('|', collect_set(os_type)),
concat_ws('|', collect_set(language)),
concat_ws('|', collect_set(channel)),
concat_ws('|', collect_set(area)),
concat_ws('|', collect_set(brand))
from dwd.dwd_start_log
where dt='$do_date'
group by device_id;
-- Summarize to get weekly active members
insert overwrite table dws.dws_member_start_week
partition(dt='$do_date')
select device_id,
concat_ws('|', collect_set(uid)),
concat_ws('|', collect_set(app_v)),
concat_ws('|', collect_set(os_type)),
concat_ws('|', collect_set(language)),
concat_ws('|', collect_set(channel)),
concat_ws('|', collect_set(area)),
concat_ws('|', collect_set(brand)),
date_add(next_day('$do_date', 'mo'), -7)
from dws.dws_member_start_day
where dt >= date_add(next_day('$do_date', 'mo'), -7)
and dt <= '$do_date'
group by device_id;
-- Summarize to get monthly active members
insert overwrite table dws.dws_member_start_month
partition(dt='$do_date')
select device_id,
concat_ws('|', collect_set(uid)),
concat_ws('|', collect_set(app_v)),
concat_ws('|', collect_set(os_type)),
concat_ws('|', collect_set(language)),
concat_ws('|', collect_set(channel)),
concat_ws('|', collect_set(area)),
concat_ws('|', collect_set(brand)),
date_format('$do_date', 'yyyy-MM')
from dws.dws_member_start_day
where dt >= date_format('$do_date', 'yyyy-MM-01')
and dt <= '$do_date'
group by device_id;
"
hive -e "$sql"
ADS Layer Table Creation Script
drop table if exists ads.ads_member_active_count;
create table ads.ads_member_active_count(
`day_count` int COMMENT 'Member count for the day',
`week_count` int COMMENT 'Member count for the week',
`month_count` int COMMENT 'Member count for the month'
) COMMENT 'Active Member Count'
partitioned by(dt string)
row format delimited fields terminated by ',';
ADS Layer Data Loading Script (Method 1: WITH + UNION ALL)
#!/bin/bash
source /etc/profile
if [ -n "$1" ] ;then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
with tmp as(
select 'day' datelabel, count(*) cnt, dt
from dws.dws_member_start_day
where dt='$do_date'
group by dt
union all
select 'week' datelabel, count(*) cnt, dt
from dws.dws_member_start_week
where dt='$do_date'
group by dt
union all
select 'month' datelabel, count(*) cnt, dt
from dws.dws_member_start_month
where dt='$do_date'
group by dt
)
insert overwrite table ads.ads_member_active_count
partition(dt='$do_date')
select sum(case when datelabel='day' then cnt end) as day_count,
sum(case when datelabel='week' then cnt end) as week_count,
sum(case when datelabel='month' then cnt end) as month_count
from tmp
group by dt;
"
hive -e "$sql"
ADS Layer Data Loading Script (Method 2: JOIN)
#!/bin/bash
source /etc/profile
if [ -n "$1" ] ;then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi
sql="
insert overwrite table ads.ads_member_active_count
partition(dt='$do_date')
select daycnt, weekcnt, monthcnt
from (select dt, count(*) daycnt
from dws.dws_member_start_day
where dt='$do_date'
group by dt
) day join
(select dt, count(*) weekcnt
from dws.dws_member_start_week
where dt='$do_date'
group by dt
) week on day.dt=week.dt
join
(select dt, count(*) monthcnt
from dws.dws_member_start_month
where dt='$do_date'
group by dt
) month on day.dt=month.dt;
"
hive -e "$sql"
Comparison of Two ADS Script Methods
- Method 1 (WITH + UNION ALL): Aggregates data from different dimensions through a temporary table, then calculates final results using SUM. This method is flexible and easy to extend.
- Method 2 (JOIN): Directly joins data from day, week, and month dimensions through JOIN. This may be more performant but less extensible.
Common Error Troubleshooting
| Symptom | Root Cause | Fix |
|---|---|---|
| Shell script reports “bad interpreter” | Shebang written as #!/bin/bash (full-width exclamation) or script has Windows line endings | Change to #!/bin/bash; run sed -i 's/\r$//' script.sh |
| uid/app_v in DWS daily table become long abc strings | Using collect_set to concatenate strings is for “storage display fields” | Avoid directly using collect_set to concatenate long strings |
| Weekly active count口径 unstable | Weekly table partitioned by (dt=batch date), but week range uses “Monday to do_date” rolling window | Weekly summary should be partitioned/output by week |
| next_day reports error in some environments | weekday parameter depends on implementation/language environment | Use explicit写法 like next_day(date_sub('$do_date',7),'MO') |
| ADS output is NULL or 0 | dt partition values inconsistent, or no data on batch date | Ensure ODS/DWD/DWS use the same do_date |
| Hive reports parse error | Multi-statement concatenation causes truncation | End each statement with ; and add line breaks |