TL;DR
- Scenario: Use startup logs/event logs in offline data warehouse to count new, active (DAU/WAU/MAU), retention.
- Conclusion: Use Flume 1.8+ Taildir on collection side to monitor logs and write to HDFS partitions, data warehouse layering carries the criterion and aggregation.
- Output: A reusable Taildir→HDFS Sink configuration template + metric definition alignment points + common failure troubleshooting list.
Requirement Analysis
Member data is very important for later marketing. Online stores will specifically conduct a series of marketing activities for members. E-commerce members generally have low threshold, can join by registering on website. Some e-commerce platform senior members have time limit, need to purchase VIP membership card or consume certain amount within a year to become senior members.
Calculation Metrics
- New Members: Number of new members each time
- Active Members: Daily, weekly, monthly active member count
- Member Retention: 1, 2, 3 day member retention count, 1, 2, 3 day member retention rate
Metric Definition Business Logic
Member Identification Criteria
Member Definition: Using mobile device as unique identifier, each independent device is considered one member. Specific identification methods:
- Android System: Identified by device IMEI number (International Mobile Equipment Identity), each Android device has unique 15-digit IMEI code
- iOS System: Identified by OpenUDID (Open Unique Device Identifier), this is the universal unique identifier for iOS devices
- Special Case Handling: Multiple devices of the same user will be calculated as different members; device flashing or factory reset still considered the same member
Active Member Definition
- Daily Active Members (DAU): Independent devices that opened the app at least once during natural day (00:00-23:59), regardless of how many times opened, count as 1 active member
- Weekly Active Members (WAU): Independent devices that started the app at least once during natural week (Monday to Sunday)
- Monthly Active Members (MAU): Independent devices that started the app at least once during natural month (1st to end of month)
Member Activity Rate Calculation
- Daily Activity Rate (DAU Rate): Calculation formula = Daily Active Members ÷ Total Registered Members × 100%
- Weekly Activity Rate (WAU Rate): Calculation formula = Weekly Active Members ÷ Total Registered Members × 100%
- Monthly Activity Rate (MAU Rate): Calculation formula = Monthly Active Members ÷ Total Registered Members × 100%, usually MAU rate of 10-20% is healthy level
New Member Statistics
New Member Definition: Member who installed and started the app for the first time on the device
- Deduplication Mechanism: Reinstall after uninstall not counted as new member; changing device but logging in same account not counted as new member
- Statistics dimensions: Daily new, weekly new, monthly new
Retention Analysis
- Retained Members: Members newly added in initial time period (e.g., certain day/week/month) who remain active in subsequent time period
- Retention Rate Calculation: Calculation formula = Retained Members ÷ Initial New Members × 100%
- Common metrics: Next-day retention, 7-day retention, 30-day retention
Log Data Collection
Raw Log Data (One Startup Log)
2020-07-30 14:18:47.339 [main] INFO com.ecommerce.AppStart - {"app_active":{"name":"app_active","json":{"entry":"1","action":"1","error_code":"0"},"time":1596111888529},"attr":{"area":"泰安","uid":"2F10092A9","app_v":"1.1.13","event_type":"common","device_id":"1FB872-9A1009","os_type":"4.7.3","channel":"DK","language":"chinese","brand":"iphone-9"}}
Taildir Source Characteristics
- Uses regular expression to match file names in directory
- Once data is written to monitored files, Flume will write information to specified Sink
- High reliability, will not lose data
- Will not do any processing on tracked files, will not rename or delete
- Does not support Windows, cannot read binary files, supports reading text files line by line
Tail Source Configuration
- positionFile: Configure checkpoint file path, checkpoint file saves already read file positions in JSON format, solving breakpoint resume problem
- filegroups: Specify filegroups, can have multiple, separated by spaces (taildir source can simultaneously monitor files in multiple directories)
- filegroups.f1: Configure absolute file path for each filegroup, file name can use regular expression matching
HDFS Sink Configuration
HDFS Sink uses rolling file generation, rolling strategies:
- Based on time: hdfs.rollInterval 30 seconds
- Based on file size: hdfs.rollSize 1024 bytes
- Based on Event count: hdfs.rollCount 10 events
- Based on file idle time: hdfs.idleTimeout 0
- 0 = disabled
- minBlockReplicas, default same as HDFS replica count. Set to 1 to make Flume unaware of HDFS block replication
Agent Configuration
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# taildir source
a1.sources.r1.type = TAILDIR
a1.sources.r1.positionFile = /opt/wzk/conf/startlog_position.json
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1 = /opt/wzk/logs/start/.*log
# memorychannel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 100000
a1.channels.c1.transactionCapacity = 2000
# hdfs sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /user/data/logs/start/%Y-%m-%d/
a1.sinks.k1.hdfs.filePrefix = startlog.
a1.sinks.k1.hdfs.fileType = DataStream
# Configure file rolling method (file size 32M)
a1.sinks.k1.hdfs.rollSize = 33554432
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.rollInterval = 0
a1.sinks.k1.hdfs.idleTimeout = 0
a1.sinks.k1.hdfs.minBlockReplicas = 1
# Number of events flushed to hdfs
a1.sinks.k1.hdfs.batchSize = 1000
# Use local time
a1.sinks.k1.hdfs.useLocalTimeStamp = true
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
Flume Optimization Configuration
flume-ng agent --conf-file /opt/wzk/flume-conf/flume-log2hdfs1.conf -name a1 -Dflume.root.logger=INFO,console
Error Quick Reference
| Symptom | Root Cause Analysis | Fix |
|---|---|---|
| Taildir not collecting, HDFS has no new files | filegroups regex not matching files/path wrong | Search Flume logs for TaildirSource, see monitored file list; check if path matches |
| positionFile not updating/collection intermittent | positionFile no permission or directory doesn’t exist | Check if positionFile generates JSON, grows; check Flume running user permissions |
| HDFS write failing, repeatedly retrying | HDFS permissions/NameNode connection/path doesn’t exist | Search Flume logs for hdfs / Permission denied / Failed; check target directory permissions on HDFS side |
| Files not rolling for a long time, HDFS directory empty | rollSize/rollInterval/rollCount all disabled or conditions not triggered | For low traffic testing, add rollInterval (e.g., 60/300 seconds) or reduce rollSize |
| Memory spike/Channel OOM/throughput jitter | Memory channel capacity too large, downstream HDFS writes slow causing backlog | Reduce capacity; increase Sink concurrency/optimize HDFS; if necessary switch to file channel |
| Data duplication/gap (after breakpoint anomaly) | positionFile cleared/rolled back; log rotation strategy inconsistent with Taildir tracking | Keep positionFile stable and persistent; avoid frequent delete/modify; log rotation using append mode more stable |
| Partition date wrong (UTC/time offset) | Timestamp source inconsistent with useLocalTimeStamp | Enable hdfs.useLocalTimeStamp=true |