This is article 90 in the Big Data series, serving as the starting point of the Flink series, comprehensively introducing Apache Flink’s design philosophy, technical features, and architecture components.
What is Apache Flink
Apache Flink is an open-source distributed stream processing framework, specifically designed for high-performance, low-latency processing of both bounded (batch) and unbounded (stream) data.
Flink’s core design philosophy is unified stream-batch: Batch processing is treated as a special case of stream processing (bounded finite stream), not as two separate computing models. This allows the same API and runtime to process both real-time streaming data and offline batch data.
Development History
| Time | Event |
|---|---|
| 2008 | Stratosphere research project started at TU Berlin |
| April 2014 | Donated to Apache Software Foundation |
| December 2014 | Became Apache top-level project, officially named Flink (German: flexible, agile) |
| Present | Large-scale production use by Alibaba, Uber, Netflix, etc. |
Alibaba is one of Flink’s most important enterprise contributors—their internally maintained Blink version has been contributed back to the community mainline.
Core Technical Features
1. Exactly-Once State Consistency
Flink implements fault tolerance based on lightweight distributed snapshots (Chandy-Lamport algorithm), ensuring precise recovery to the state from the last Checkpoint upon node failure—each record processed exactly once.
2. Event Time Processing
Flink natively supports Event Time semantics, handling out-of-order data through Watermark mechanism, precisely processing data by generation time rather than system arrival time.
3. Layered API Architecture
SQL / Table API ← Declarative, suitable for analysts
DataStream ← Core stream processing API
ProcessFunction ← Lowest level, complete control
4. Memory Management
Flink manages JVM memory itself (TypedSerializer + MemorySegment), avoiding GC pressure—significantly better memory efficiency than Spark in large data scenarios.
5. Elastic Scalability
Horizontally scales to thousands of nodes, can process PB-level data with millisecond-level latency.
Processing Models: Bounded vs Unbounded
Unbounded Streams:
- Data continuously generated, no natural endpoint
- Must be processed in real-time, 7×24 operation
- Typical sources: Kafka, message queues, sensor data
Bounded Streams:
- Have clear start and end
- Can wait for all data to arrive before processing
- Equivalent to traditional batch processing, terminates after completion
Architecture Components
JobManager
The cluster’s “brain”, responsible for:
- Receiving and parsing job graphs (JobGraph → ExecutionGraph)
- Task scheduling and resource allocation
- Checkpoint coordination and fault recovery
- Monitoring TaskManager heartbeats
TaskManager
The “worker” executing tasks, responsible for:
- Running assigned Tasks (SubTasks)
- Managing local state
- Handling network data transfer between tasks
- Reporting status to JobManager
Dispatcher
Job submission entry point, provides REST API interface, forwards submitted jobs to corresponding JobManager. Also runs Flink Web UI.
ResourceManager
Interfaces with underlying cluster management systems (YARN, Kubernetes, Standalone), responsible for computing resource (TaskSlot) application, allocation, and release.
State Backend
State storage abstraction layer with three implementations:
| Backend | Storage Location | Use Case |
|---|---|---|
| MemoryStateBackend | JVM Heap | Development/testing |
| FsStateBackend | File system + Heap | Medium state scale |
| RocksDBStateBackend | RocksDB (local disk) | Ultra-large state, production recommended |
Flink vs Spark Streaming
| Dimension | Flink | Spark Streaming |
|---|---|---|
| Processing Model | Native stream processing | Micro-batch |
| Latency | Millisecond-level | Second-level (batch interval) |
| State Management | Native built-in | Depends on RDD checkpoint |
| Event Time | Native support | Supported but complex |
| Batch Processing | Unified API | Separate Spark Core API |
| Learning Curve | Steeper | Unified with Spark ecosystem |
Use Cases
Typical scenarios for choosing Flink:
- Real-time risk control requiring millisecond-level end-to-end latency
- Complex Event Processing (CEP) and multi-stream joins
- Large-scale stateful stream computing (e.g., real-time recommendations, real-time features)
- Financial scenarios requiring strict Exactly-Once semantics
Subsequent articles will progressively introduce Flink DataStream API programming, window operations, state management, and Checkpoint configuration.