Flume Collect Hive Logs to HDFS

This is article 19 in the Big Data series. Demonstrates how to use Flume to real-time collect Hive runtime logs and write to HDFS with time-based partitioning.

Complete illustrated version: CSDN Original | Juejin

Case Goal

Real-time collect Hive runtime log files to HDFS, path partitioned by date and hour, convenient for subsequent log analysis with Hive SQL:

/flume/20240717/1430/logs-xxxxx

Components used:

Source: exec source, executes tail -F command to real-time track log files
Channel: memory channel, high-performance memory buffer
Sink: HDFS sink, writes to HDFS with time-based partitioning

Configuration File

Create config file /opt/wzk/flume_test/flume-exec-hdfs.conf:

# Agent component declaration
a2.sources = r2
a2.sinks = k2
a2.channels = c2

# Source: real-time track Hive logs
a2.sources.r2.type = exec
a2.sources.r2.command = tail -F /tmp/root/hive.log

# Channel: memory buffer
a2.channels.c2.type = memory
a2.channels.c2.capacity = 10000
a2.channels.c2.transactionCapacity = 500

# Sink: write to HDFS, partitioned by minute
a2.sinks.k2.type = hdfs
a2.sinks.k2.hdfs.path = hdfs://h121.wzk.icu:9000/flume/%Y%m%d/%H%M
a2.sinks.k2.hdfs.filePrefix = logs-
a2.sinks.k2.hdfs.batchSize = 500
a2.sinks.k2.hdfs.rollInterval = 60
a2.sinks.k2.hdfs.rollSize = 134217700
a2.sinks.k2.hdfs.rollCount = 0
a2.sinks.k2.hdfs.minBlockReplicas = 1

# Binding
a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2

Key Parameters Detailed

Time Placeholders in HDFS Path

%Y%m%d/%H%M will be replaced with actual time, e.g., 20240717/1430. Note: Need timestamp field in Event header, or configure TimestampInterceptor on Source.

If don’t want to depend on Event header, can enable local timestamp:

a2.sinks.k2.hdfs.useLocalTimeStamp = true

File Rolling Strategy

HDFS Sink controls when to close current file and open new file via three parameters:

Parameter	Value	Description
`rollInterval`	60	Roll every 60 seconds (0 means no time-based rolling)
`rollSize`	134217700	Roll when file exceeds 128MB (0 means no size-based rolling)
`rollCount`	0	Don’t roll by Event count

Any of the three conditions triggers rolling. This example uses pure time-based rolling.

Replica Count Control

minBlockReplicas = 1 sets minimum replica count for HDFS writes (default follows HDFS cluster config), reduces write wait time in test environment.

batchSize

Number of Events Source takes from Channel and writes to HDFS each time, larger value means higher throughput but also higher memory usage.

Start Collection Task

First confirm Hive log path exists (will be automatically generated after executing an HQL):

ls /tmp/root/hive.log

Start Flume Agent:

$FLUME_HOME/bin/flume-ng agent \
  --name a2 \
  --conf-file /opt/wzk/flume_test/flume-exec-hdfs.conf \
  -Dflume.root.logger=INFO,console

Verify Results

Execute a few SQL statements in Hive to generate logs, then check files on HDFS:

hdfs dfs -ls /flume/
hdfs dfs -ls /flume/20240717/

After rolling period (60 seconds), filename changes from .tmp suffix to official file, indicating write completed:

/flume/20240717/1430/logs-h121.wzk.icu-1721214600000.1721214660000

Use hdfs dfs -cat to view file content, should match Hive logs.