TL;DR

  • Scenario: Use startup logs/event logs in offline data warehouse to count new, active (DAU/WAU/MAU), retention.
  • Conclusion: Use Flume 1.8+ Taildir on collection side to monitor logs and write to HDFS partitions, data warehouse layering carries the criterion and aggregation.
  • Output: A reusable Taildir→HDFS Sink configuration template + metric definition alignment points + common failure troubleshooting list.

Requirement Analysis

Member data is very important for later marketing. Online stores will specifically conduct a series of marketing activities for members. E-commerce members generally have low threshold, can join by registering on website. Some e-commerce platform senior members have time limit, need to purchase VIP membership card or consume certain amount within a year to become senior members.

Calculation Metrics

  • New Members: Number of new members each time
  • Active Members: Daily, weekly, monthly active member count
  • Member Retention: 1, 2, 3 day member retention count, 1, 2, 3 day member retention rate

Metric Definition Business Logic

Member Identification Criteria

Member Definition: Using mobile device as unique identifier, each independent device is considered one member. Specific identification methods:

  • Android System: Identified by device IMEI number (International Mobile Equipment Identity), each Android device has unique 15-digit IMEI code
  • iOS System: Identified by OpenUDID (Open Unique Device Identifier), this is the universal unique identifier for iOS devices
  • Special Case Handling: Multiple devices of the same user will be calculated as different members; device flashing or factory reset still considered the same member

Active Member Definition

  • Daily Active Members (DAU): Independent devices that opened the app at least once during natural day (00:00-23:59), regardless of how many times opened, count as 1 active member
  • Weekly Active Members (WAU): Independent devices that started the app at least once during natural week (Monday to Sunday)
  • Monthly Active Members (MAU): Independent devices that started the app at least once during natural month (1st to end of month)

Member Activity Rate Calculation

  • Daily Activity Rate (DAU Rate): Calculation formula = Daily Active Members ÷ Total Registered Members × 100%
  • Weekly Activity Rate (WAU Rate): Calculation formula = Weekly Active Members ÷ Total Registered Members × 100%
  • Monthly Activity Rate (MAU Rate): Calculation formula = Monthly Active Members ÷ Total Registered Members × 100%, usually MAU rate of 10-20% is healthy level

New Member Statistics

New Member Definition: Member who installed and started the app for the first time on the device

  • Deduplication Mechanism: Reinstall after uninstall not counted as new member; changing device but logging in same account not counted as new member
  • Statistics dimensions: Daily new, weekly new, monthly new

Retention Analysis

  • Retained Members: Members newly added in initial time period (e.g., certain day/week/month) who remain active in subsequent time period
  • Retention Rate Calculation: Calculation formula = Retained Members ÷ Initial New Members × 100%
  • Common metrics: Next-day retention, 7-day retention, 30-day retention

Log Data Collection

Raw Log Data (One Startup Log)

2020-07-30 14:18:47.339 [main] INFO com.ecommerce.AppStart - {"app_active":{"name":"app_active","json":{"entry":"1","action":"1","error_code":"0"},"time":1596111888529},"attr":{"area":"泰安","uid":"2F10092A9","app_v":"1.1.13","event_type":"common","device_id":"1FB872-9A1009","os_type":"4.7.3","channel":"DK","language":"chinese","brand":"iphone-9"}}

Taildir Source Characteristics

  • Uses regular expression to match file names in directory
  • Once data is written to monitored files, Flume will write information to specified Sink
  • High reliability, will not lose data
  • Will not do any processing on tracked files, will not rename or delete
  • Does not support Windows, cannot read binary files, supports reading text files line by line

Tail Source Configuration

  • positionFile: Configure checkpoint file path, checkpoint file saves already read file positions in JSON format, solving breakpoint resume problem
  • filegroups: Specify filegroups, can have multiple, separated by spaces (taildir source can simultaneously monitor files in multiple directories)
  • filegroups.f1: Configure absolute file path for each filegroup, file name can use regular expression matching

HDFS Sink Configuration

HDFS Sink uses rolling file generation, rolling strategies:

  • Based on time: hdfs.rollInterval 30 seconds
  • Based on file size: hdfs.rollSize 1024 bytes
  • Based on Event count: hdfs.rollCount 10 events
  • Based on file idle time: hdfs.idleTimeout 0
  • 0 = disabled
  • minBlockReplicas, default same as HDFS replica count. Set to 1 to make Flume unaware of HDFS block replication

Agent Configuration

a1.sources = r1
a1.sinks = k1
a1.channels = c1

# taildir source
a1.sources.r1.type = TAILDIR
a1.sources.r1.positionFile = /opt/wzk/conf/startlog_position.json
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1 = /opt/wzk/logs/start/.*log

# memorychannel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 100000
a1.channels.c1.transactionCapacity = 2000

# hdfs sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /user/data/logs/start/%Y-%m-%d/
a1.sinks.k1.hdfs.filePrefix = startlog.
a1.sinks.k1.hdfs.fileType = DataStream
# Configure file rolling method (file size 32M)
a1.sinks.k1.hdfs.rollSize = 33554432
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.rollInterval = 0
a1.sinks.k1.hdfs.idleTimeout = 0
a1.sinks.k1.hdfs.minBlockReplicas = 1
# Number of events flushed to hdfs
a1.sinks.k1.hdfs.batchSize = 1000
# Use local time
a1.sinks.k1.hdfs.useLocalTimeStamp = true

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Flume Optimization Configuration

flume-ng agent --conf-file /opt/wzk/flume-conf/flume-log2hdfs1.conf -name a1 -Dflume.root.logger=INFO,console

Error Quick Reference

SymptomRoot Cause AnalysisFix
Taildir not collecting, HDFS has no new filesfilegroups regex not matching files/path wrongSearch Flume logs for TaildirSource, see monitored file list; check if path matches
positionFile not updating/collection intermittentpositionFile no permission or directory doesn’t existCheck if positionFile generates JSON, grows; check Flume running user permissions
HDFS write failing, repeatedly retryingHDFS permissions/NameNode connection/path doesn’t existSearch Flume logs for hdfs / Permission denied / Failed; check target directory permissions on HDFS side
Files not rolling for a long time, HDFS directory emptyrollSize/rollInterval/rollCount all disabled or conditions not triggeredFor low traffic testing, add rollInterval (e.g., 60/300 seconds) or reduce rollSize
Memory spike/Channel OOM/throughput jitterMemory channel capacity too large, downstream HDFS writes slow causing backlogReduce capacity; increase Sink concurrency/optimize HDFS; if necessary switch to file channel
Data duplication/gap (after breakpoint anomaly)positionFile cleared/rolled back; log rotation strategy inconsistent with Taildir trackingKeep positionFile stable and persistent; avoid frequent delete/modify; log rotation using append mode more stable
Partition date wrong (UTC/time offset)Timestamp source inconsistent with useLocalTimeStampEnable hdfs.useLocalTimeStamp=true