Offline Data Warehouse Member Metrics Practice

TL;DR

Scenario: Use startup logs/event logs in offline data warehouse to count new, active (DAU/WAU/MAU), retention.
Conclusion: Use Flume 1.8+ Taildir on collection side to monitor logs and write to HDFS partitions, data warehouse layering carries the criterion and aggregation.
Output: A reusable Taildir→HDFS Sink configuration template + metric definition alignment points + common failure troubleshooting list.

Requirement Analysis

Member data is very important for later marketing. Online stores will specifically conduct a series of marketing activities for members. E-commerce members generally have low threshold, can join by registering on website. Some e-commerce platform senior members have time limit, need to purchase VIP membership card or consume certain amount within a year to become senior members.

Calculation Metrics

New Members: Number of new members each time
Active Members: Daily, weekly, monthly active member count
Member Retention: 1, 2, 3 day member retention count, 1, 2, 3 day member retention rate

Metric Definition Business Logic

Member Identification Criteria

Member Definition: Using mobile device as unique identifier, each independent device is considered one member. Specific identification methods:

Android System: Identified by device IMEI number (International Mobile Equipment Identity), each Android device has unique 15-digit IMEI code
iOS System: Identified by OpenUDID (Open Unique Device Identifier), this is the universal unique identifier for iOS devices
Special Case Handling: Multiple devices of the same user will be calculated as different members; device flashing or factory reset still considered the same member

Active Member Definition

Daily Active Members (DAU): Independent devices that opened the app at least once during natural day (00:00-23:59), regardless of how many times opened, count as 1 active member
Weekly Active Members (WAU): Independent devices that started the app at least once during natural week (Monday to Sunday)
Monthly Active Members (MAU): Independent devices that started the app at least once during natural month (1st to end of month)

Member Activity Rate Calculation

Daily Activity Rate (DAU Rate): Calculation formula = Daily Active Members ÷ Total Registered Members × 100%
Weekly Activity Rate (WAU Rate): Calculation formula = Weekly Active Members ÷ Total Registered Members × 100%
Monthly Activity Rate (MAU Rate): Calculation formula = Monthly Active Members ÷ Total Registered Members × 100%, usually MAU rate of 10-20% is healthy level

New Member Statistics

New Member Definition: Member who installed and started the app for the first time on the device

Deduplication Mechanism: Reinstall after uninstall not counted as new member; changing device but logging in same account not counted as new member
Statistics dimensions: Daily new, weekly new, monthly new

Retention Analysis

Retained Members: Members newly added in initial time period (e.g., certain day/week/month) who remain active in subsequent time period
Retention Rate Calculation: Calculation formula = Retained Members ÷ Initial New Members × 100%
Common metrics: Next-day retention, 7-day retention, 30-day retention

Log Data Collection

Raw Log Data (One Startup Log)

2020-07-30 14:18:47.339 [main] INFO com.ecommerce.AppStart - {"app_active":{"name":"app_active","json":{"entry":"1","action":"1","error_code":"0"},"time":1596111888529},"attr":{"area":"泰安","uid":"2F10092A9","app_v":"1.1.13","event_type":"common","device_id":"1FB872-9A1009","os_type":"4.7.3","channel":"DK","language":"chinese","brand":"iphone-9"}}

Taildir Source Characteristics

Uses regular expression to match file names in directory
Once data is written to monitored files, Flume will write information to specified Sink
High reliability, will not lose data
Will not do any processing on tracked files, will not rename or delete
Does not support Windows, cannot read binary files, supports reading text files line by line

Tail Source Configuration

positionFile: Configure checkpoint file path, checkpoint file saves already read file positions in JSON format, solving breakpoint resume problem
filegroups: Specify filegroups, can have multiple, separated by spaces (taildir source can simultaneously monitor files in multiple directories)
filegroups.f1: Configure absolute file path for each filegroup, file name can use regular expression matching

HDFS Sink Configuration

HDFS Sink uses rolling file generation, rolling strategies:

Based on time: hdfs.rollInterval 30 seconds
Based on file size: hdfs.rollSize 1024 bytes
Based on Event count: hdfs.rollCount 10 events
Based on file idle time: hdfs.idleTimeout 0
0 = disabled
minBlockReplicas, default same as HDFS replica count. Set to 1 to make Flume unaware of HDFS block replication

Agent Configuration

a1.sources = r1
a1.sinks = k1
a1.channels = c1

# taildir source
a1.sources.r1.type = TAILDIR
a1.sources.r1.positionFile = /opt/wzk/conf/startlog_position.json
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1 = /opt/wzk/logs/start/.*log

# memorychannel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 100000
a1.channels.c1.transactionCapacity = 2000

# hdfs sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /user/data/logs/start/%Y-%m-%d/
a1.sinks.k1.hdfs.filePrefix = startlog.
a1.sinks.k1.hdfs.fileType = DataStream
# Configure file rolling method (file size 32M)
a1.sinks.k1.hdfs.rollSize = 33554432
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.rollInterval = 0
a1.sinks.k1.hdfs.idleTimeout = 0
a1.sinks.k1.hdfs.minBlockReplicas = 1
# Number of events flushed to hdfs
a1.sinks.k1.hdfs.batchSize = 1000
# Use local time
a1.sinks.k1.hdfs.useLocalTimeStamp = true

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Flume Optimization Configuration

flume-ng agent --conf-file /opt/wzk/flume-conf/flume-log2hdfs1.conf -name a1 -Dflume.root.logger=INFO,console

Error Quick Reference

Symptom	Root Cause Analysis	Fix
Taildir not collecting, HDFS has no new files	filegroups regex not matching files/path wrong	Search Flume logs for TaildirSource, see monitored file list; check if path matches
positionFile not updating/collection intermittent	positionFile no permission or directory doesn’t exist	Check if positionFile generates JSON, grows; check Flume running user permissions
HDFS write failing, repeatedly retrying	HDFS permissions/NameNode connection/path doesn’t exist	Search Flume logs for hdfs / Permission denied / Failed; check target directory permissions on HDFS side
Files not rolling for a long time, HDFS directory empty	rollSize/rollInterval/rollCount all disabled or conditions not triggered	For low traffic testing, add rollInterval (e.g., 60/300 seconds) or reduce rollSize
Memory spike/Channel OOM/throughput jitter	Memory channel capacity too large, downstream HDFS writes slow causing backlog	Reduce capacity; increase Sink concurrency/optimize HDFS; if necessary switch to file channel
Data duplication/gap (after breakpoint anomaly)	positionFile cleared/rolled back; log rotation strategy inconsistent with Taildir tracking	Keep positionFile stable and persistent; avoid frequent delete/modify; log rotation using append mode more stable
Partition date wrong (UTC/time offset)	Timestamp source inconsistent with useLocalTimeStamp	Enable hdfs.useLocalTimeStamp=true