This is article 17 in the Big Data series. Introduces distributed log collection system Apache Flume’s architecture design and core concepts.

Complete illustrated version: CSDN Original | Juejin

What is Flume

Apache Flume is a distributed, highly reliable log collection system designed for real-time data transmission in big data scenarios. It can collect data from various sources (log files, network ports, message queues, etc.) and aggregate transmit to storage systems like HDFS, HBase, and Kafka.

Typical use case: Server logs → Flume → HDFS, then analyzed offline by Hive/MapReduce.

Core Components

Each running instance in Flume is called an Agent, which is an independent JVM process. Each Agent consists of three core components:

ComponentRoleCommon Types
SourceReceives data from external sourcesnetcat, exec, spooldir, taildir, kafka, http
ChannelBuffers data between Source and Sinkmemory, file
SinkWrites data to target systemhdfs, hbase, kafka, logger, avro

Event is the smallest data transmission unit in Flume, consisting of two parts:

  • headers: Metadata in key-value pair form
  • body: Actual content in byte array form

Channel Comparison

FeatureMemory ChannelFile Channel
Storage LocationJVM heap memoryDisk files
PerformanceHighRelatively low
ReliabilityData lost on process crashPersistent, recoverable
Applicable ScenarioHigh throughput scenarios allowing small data lossProduction scenarios requiring no data loss

Data Processing Extensions

Beyond the basic three components, Flume also supports:

  • Interceptor: Pre-processes Events after Source collection and before writing to Channel, such as filtering, modifying headers, adding timestamps, etc.
  • Channel Selector: Decides which Channels to route Events to (Replicating / Multiplexing)
  • Sink Processor: Manages load balancing or failover strategies for multiple Sinks

Data Flow Topologies

Serial Mode

Multiple Flume Agents cascade, upstream Agent’s Sink connects to downstream Agent’s Source (via Avro protocol), achieving progressive data transmission.

Server A (exec source) → Avro Sink → Avro Source → HDFS Sink

Advantages: Decouple collection and storage; Disadvantages: Performance degrades with long chains, single point of risk.

Replication Mode

One Source writes data to multiple Channels simultaneously, each Channel corresponds to different Sink target, achieving data multi-path distribution:

Source → Channel Selector (replicating) → Channel1 → HDFS Sink
                                        → Channel2 → Kafka Sink

Load Balancing Mode

Multiple Sinks form a SinkGroup, Sink Processor allocates data by round-robin or random strategy, improving write throughput.

Installation and Configuration

Using Flume 1.9.0 as example, extract and configure environment variables:

export FLUME_HOME=/opt/wzk/flume
export PATH=$PATH:$FLUME_HOME/bin

Copy and edit environment configuration file, must explicitly declare JAVA_HOME, otherwise startup will error:

cp $FLUME_HOME/conf/flume-env.sh.template $FLUME_HOME/conf/flume-env.sh
# Add in flume-env.sh:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

Verify installation:

flume-ng version