Apache Flume Architecture and Core Concepts

This is article 17 in the Big Data series. Introduces distributed log collection system Apache Flume’s architecture design and core concepts.

Complete illustrated version: CSDN Original | Juejin

What is Flume

Apache Flume is a distributed, highly reliable log collection system designed for real-time data transmission in big data scenarios. It can collect data from various sources (log files, network ports, message queues, etc.) and aggregate transmit to storage systems like HDFS, HBase, and Kafka.

Typical use case: Server logs → Flume → HDFS, then analyzed offline by Hive/MapReduce.

Core Components

Each running instance in Flume is called an Agent, which is an independent JVM process. Each Agent consists of three core components:

Component	Role	Common Types
Source	Receives data from external sources	netcat, exec, spooldir, taildir, kafka, http
Channel	Buffers data between Source and Sink	memory, file
Sink	Writes data to target system	hdfs, hbase, kafka, logger, avro

Event is the smallest data transmission unit in Flume, consisting of two parts:

headers: Metadata in key-value pair form
body: Actual content in byte array form

Channel Comparison

Feature	Memory Channel	File Channel
Storage Location	JVM heap memory	Disk files
Performance	High	Relatively low
Reliability	Data lost on process crash	Persistent, recoverable
Applicable Scenario	High throughput scenarios allowing small data loss	Production scenarios requiring no data loss

Data Processing Extensions

Beyond the basic three components, Flume also supports:

Interceptor: Pre-processes Events after Source collection and before writing to Channel, such as filtering, modifying headers, adding timestamps, etc.
Channel Selector: Decides which Channels to route Events to (Replicating / Multiplexing)
Sink Processor: Manages load balancing or failover strategies for multiple Sinks

Data Flow Topologies

Serial Mode

Multiple Flume Agents cascade, upstream Agent’s Sink connects to downstream Agent’s Source (via Avro protocol), achieving progressive data transmission.

Server A (exec source) → Avro Sink → Avro Source → HDFS Sink

Advantages: Decouple collection and storage; Disadvantages: Performance degrades with long chains, single point of risk.

Replication Mode

One Source writes data to multiple Channels simultaneously, each Channel corresponds to different Sink target, achieving data multi-path distribution:

Source → Channel Selector (replicating) → Channel1 → HDFS Sink
                                        → Channel2 → Kafka Sink

Load Balancing Mode

Multiple Sinks form a SinkGroup, Sink Processor allocates data by round-robin or random strategy, improving write throughput.

Installation and Configuration

Using Flume 1.9.0 as example, extract and configure environment variables:

export FLUME_HOME=/opt/wzk/flume
export PATH=$PATH:$FLUME_HOME/bin

Copy and edit environment configuration file, must explicitly declare JAVA_HOME, otherwise startup will error:

cp $FLUME_HOME/conf/flume-env.sh.template $FLUME_HOME/conf/flume-env.sh
# Add in flume-env.sh:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

Verify installation:

flume-ng version