Overview

This article introduces using Hive to build an offline data warehouse for calculating active members (daily/weekly/monthly active users), covering the complete flow from DWD detail layer to DWS summary layer to ADS application layer.

Core Concepts

Active Members: Members who opened the application are considered active members.

Active Member Metrics Requirements: Daily, weekly, and monthly active member counts.

Data Flow:

  • DWD: Daily member startup detail information
  • DWS: Daily/weekly/monthly active member information
  • ADS: Daily, weekly, monthly active member counts (output)

DWS Layer Table Creation Script

use dws;
drop table if exists dws.dws_member_start_day;
create table dws.dws_member_start_day
(
  `device_id` string,
  `uid` string,
  `app_v` string,
  `os_type` string,
  `language` string,
  `channel` string,
  `area` string,
  `brand` string
) COMMENT 'Member Daily Startup Summary'
partitioned by(dt string)
stored as parquet;

drop table if exists dws.dws_member_start_week;
create table dws.dws_member_start_week(
  `device_id` string,
  `uid` string,
  `app_v` string,
  `os_type` string,
  `language` string,
  `channel` string,
  `area` string,
  `brand` string,
  `week` string
) COMMENT 'Member Weekly Startup Summary'
PARTITIONED BY (`dt` string)
stored as parquet;

drop table if exists dws.dws_member_start_month;
create table dws.dws_member_start_month(
  `device_id` string,
  `uid` string,
  `app_v` string,
  `os_type` string,
  `language` string,
  `channel` string,
  `area` string,
  `brand` string,
  `month` string
) COMMENT 'Member Monthly Startup Summary'
PARTITIONED BY (`dt` string)
stored as parquet;

DWS Layer Data Loading Script

#!/bin/bash
source /etc/profile
if [ -n "$1" ]
then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi

sql="
insert overwrite table dws.dws_member_start_day
partition(dt='$do_date')
select device_id,
concat_ws('|', collect_set(uid)),
concat_ws('|', collect_set(app_v)),
concat_ws('|', collect_set(os_type)),
concat_ws('|', collect_set(language)),
concat_ws('|', collect_set(channel)),
concat_ws('|', collect_set(area)),
concat_ws('|', collect_set(brand))
from dwd.dwd_start_log
where dt='$do_date'
group by device_id;

-- Summarize to get weekly active members
insert overwrite table dws.dws_member_start_week
partition(dt='$do_date')
select device_id,
concat_ws('|', collect_set(uid)),
concat_ws('|', collect_set(app_v)),
concat_ws('|', collect_set(os_type)),
concat_ws('|', collect_set(language)),
concat_ws('|', collect_set(channel)),
concat_ws('|', collect_set(area)),
concat_ws('|', collect_set(brand)),
date_add(next_day('$do_date', 'mo'), -7)
from dws.dws_member_start_day
where dt >= date_add(next_day('$do_date', 'mo'), -7)
and dt <= '$do_date'
group by device_id;

-- Summarize to get monthly active members
insert overwrite table dws.dws_member_start_month
partition(dt='$do_date')
select device_id,
concat_ws('|', collect_set(uid)),
concat_ws('|', collect_set(app_v)),
concat_ws('|', collect_set(os_type)),
concat_ws('|', collect_set(language)),
concat_ws('|', collect_set(channel)),
concat_ws('|', collect_set(area)),
concat_ws('|', collect_set(brand)),
date_format('$do_date', 'yyyy-MM')
from dws.dws_member_start_day
where dt >= date_format('$do_date', 'yyyy-MM-01')
and dt <= '$do_date'
group by device_id;
"
hive -e "$sql"

ADS Layer Table Creation Script

drop table if exists ads.ads_member_active_count;
create table ads.ads_member_active_count(
  `day_count` int COMMENT 'Member count for the day',
  `week_count` int COMMENT 'Member count for the week',
  `month_count` int COMMENT 'Member count for the month'
) COMMENT 'Active Member Count'
partitioned by(dt string)
row format delimited fields terminated by ',';

ADS Layer Data Loading Script (Method 1: WITH + UNION ALL)

#!/bin/bash
source /etc/profile
if [ -n "$1" ] ;then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi

sql="
with tmp as(
  select 'day' datelabel, count(*) cnt, dt
  from dws.dws_member_start_day
  where dt='$do_date'
  group by dt
  union all
  select 'week' datelabel, count(*) cnt, dt
  from dws.dws_member_start_week
  where dt='$do_date'
  group by dt
  union all
  select 'month' datelabel, count(*) cnt, dt
  from dws.dws_member_start_month
  where dt='$do_date'
  group by dt
)
insert overwrite table ads.ads_member_active_count
partition(dt='$do_date')
select sum(case when datelabel='day' then cnt end) as day_count,
sum(case when datelabel='week' then cnt end) as week_count,
sum(case when datelabel='month' then cnt end) as month_count
from tmp
group by dt;
"
hive -e "$sql"

ADS Layer Data Loading Script (Method 2: JOIN)

#!/bin/bash
source /etc/profile
if [ -n "$1" ] ;then
do_date=$1
else
do_date=`date -d "-1 day" +%F`
fi

sql="
insert overwrite table ads.ads_member_active_count
partition(dt='$do_date')
select daycnt, weekcnt, monthcnt
from (select dt, count(*) daycnt
      from dws.dws_member_start_day
      where dt='$do_date'
      group by dt
     ) day join
(select dt, count(*) weekcnt
 from dws.dws_member_start_week
 where dt='$do_date'
 group by dt
) week on day.dt=week.dt
join
(select dt, count(*) monthcnt
 from dws.dws_member_start_month
 where dt='$do_date'
 group by dt
) month on day.dt=month.dt;
"
hive -e "$sql"

Comparison of Two ADS Script Methods

  • Method 1 (WITH + UNION ALL): Aggregates data from different dimensions through a temporary table, then calculates final results using SUM. This method is flexible and easy to extend.
  • Method 2 (JOIN): Directly joins data from day, week, and month dimensions through JOIN. This may be more performant but less extensible.

Common Error Troubleshooting

SymptomRoot CauseFix
Shell script reports “bad interpreter”Shebang written as #!/bin/bash (full-width exclamation) or script has Windows line endingsChange to #!/bin/bash; run sed -i 's/\r$//' script.sh
uid/app_v in DWS daily table become long abc stringsUsing collect_set to concatenate strings is for “storage display fields”Avoid directly using collect_set to concatenate long strings
Weekly active count口径 unstableWeekly table partitioned by (dt=batch date), but week range uses “Monday to do_date” rolling windowWeekly summary should be partitioned/output by week
next_day reports error in some environmentsweekday parameter depends on implementation/language environmentUse explicit写法 like next_day(date_sub('$do_date',7),'MO')
ADS output is NULL or 0dt partition values inconsistent, or no data on batch dateEnsure ODS/DWD/DWS use the same do_date
Hive reports parse errorMulti-statement concatenation causes truncationEnd each statement with ; and add line breaks