This is article 17 in the Big Data series. Introduces distributed log collection system Apache Flume’s architecture design and core concepts.
Complete illustrated version: CSDN Original | Juejin
What is Flume
Apache Flume is a distributed, highly reliable log collection system designed for real-time data transmission in big data scenarios. It can collect data from various sources (log files, network ports, message queues, etc.) and aggregate transmit to storage systems like HDFS, HBase, and Kafka.
Typical use case: Server logs → Flume → HDFS, then analyzed offline by Hive/MapReduce.
Core Components
Each running instance in Flume is called an Agent, which is an independent JVM process. Each Agent consists of three core components:
| Component | Role | Common Types |
|---|---|---|
| Source | Receives data from external sources | netcat, exec, spooldir, taildir, kafka, http |
| Channel | Buffers data between Source and Sink | memory, file |
| Sink | Writes data to target system | hdfs, hbase, kafka, logger, avro |
Event is the smallest data transmission unit in Flume, consisting of two parts:
headers: Metadata in key-value pair formbody: Actual content in byte array form
Channel Comparison
| Feature | Memory Channel | File Channel |
|---|---|---|
| Storage Location | JVM heap memory | Disk files |
| Performance | High | Relatively low |
| Reliability | Data lost on process crash | Persistent, recoverable |
| Applicable Scenario | High throughput scenarios allowing small data loss | Production scenarios requiring no data loss |
Data Processing Extensions
Beyond the basic three components, Flume also supports:
- Interceptor: Pre-processes Events after Source collection and before writing to Channel, such as filtering, modifying headers, adding timestamps, etc.
- Channel Selector: Decides which Channels to route Events to (Replicating / Multiplexing)
- Sink Processor: Manages load balancing or failover strategies for multiple Sinks
Data Flow Topologies
Serial Mode
Multiple Flume Agents cascade, upstream Agent’s Sink connects to downstream Agent’s Source (via Avro protocol), achieving progressive data transmission.
Server A (exec source) → Avro Sink → Avro Source → HDFS Sink
Advantages: Decouple collection and storage; Disadvantages: Performance degrades with long chains, single point of risk.
Replication Mode
One Source writes data to multiple Channels simultaneously, each Channel corresponds to different Sink target, achieving data multi-path distribution:
Source → Channel Selector (replicating) → Channel1 → HDFS Sink
→ Channel2 → Kafka Sink
Load Balancing Mode
Multiple Sinks form a SinkGroup, Sink Processor allocates data by round-robin or random strategy, improving write throughput.
Installation and Configuration
Using Flume 1.9.0 as example, extract and configure environment variables:
export FLUME_HOME=/opt/wzk/flume
export PATH=$PATH:$FLUME_HOME/bin
Copy and edit environment configuration file, must explicitly declare JAVA_HOME, otherwise startup will error:
cp $FLUME_HOME/conf/flume-env.sh.template $FLUME_HOME/conf/flume-env.sh
# Add in flume-env.sh:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
Verify installation:
flume-ng version