Apache Flink Introduction: Unified Stream-Batch Real-Time...

This is article 90 in the Big Data series, serving as the starting point of the Flink series, comprehensively introducing Apache Flink’s design philosophy, technical features, and architecture components.

What is Apache Flink

Apache Flink is an open-source distributed stream processing framework, specifically designed for high-performance, low-latency processing of both bounded (batch) and unbounded (stream) data.

Flink’s core design philosophy is unified stream-batch: Batch processing is treated as a special case of stream processing (bounded finite stream), not as two separate computing models. This allows the same API and runtime to process both real-time streaming data and offline batch data.

Development History

Time	Event
2008	Stratosphere research project started at TU Berlin
April 2014	Donated to Apache Software Foundation
December 2014	Became Apache top-level project, officially named Flink (German: flexible, agile)
Present	Large-scale production use by Alibaba, Uber, Netflix, etc.

Alibaba is one of Flink’s most important enterprise contributors—their internally maintained Blink version has been contributed back to the community mainline.

Core Technical Features

1. Exactly-Once State Consistency

Flink implements fault tolerance based on lightweight distributed snapshots (Chandy-Lamport algorithm), ensuring precise recovery to the state from the last Checkpoint upon node failure—each record processed exactly once.

2. Event Time Processing

Flink natively supports Event Time semantics, handling out-of-order data through Watermark mechanism, precisely processing data by generation time rather than system arrival time.

3. Layered API Architecture

SQL / Table API    ← Declarative, suitable for analysts
    DataStream     ← Core stream processing API
      ProcessFunction ← Lowest level, complete control

4. Memory Management

Flink manages JVM memory itself (TypedSerializer + MemorySegment), avoiding GC pressure—significantly better memory efficiency than Spark in large data scenarios.

5. Elastic Scalability

Horizontally scales to thousands of nodes, can process PB-level data with millisecond-level latency.

Processing Models: Bounded vs Unbounded

Unbounded Streams:

Data continuously generated, no natural endpoint
Must be processed in real-time, 7×24 operation
Typical sources: Kafka, message queues, sensor data

Bounded Streams:

Have clear start and end
Can wait for all data to arrive before processing
Equivalent to traditional batch processing, terminates after completion

Architecture Components

JobManager

The cluster’s “brain”, responsible for:

Receiving and parsing job graphs (JobGraph → ExecutionGraph)
Task scheduling and resource allocation
Checkpoint coordination and fault recovery
Monitoring TaskManager heartbeats

TaskManager

The “worker” executing tasks, responsible for:

Running assigned Tasks (SubTasks)
Managing local state
Handling network data transfer between tasks
Reporting status to JobManager

Dispatcher

Job submission entry point, provides REST API interface, forwards submitted jobs to corresponding JobManager. Also runs Flink Web UI.

ResourceManager

Interfaces with underlying cluster management systems (YARN, Kubernetes, Standalone), responsible for computing resource (TaskSlot) application, allocation, and release.

State Backend

State storage abstraction layer with three implementations:

Backend	Storage Location	Use Case
MemoryStateBackend	JVM Heap	Development/testing
FsStateBackend	File system + Heap	Medium state scale
RocksDBStateBackend	RocksDB (local disk)	Ultra-large state, production recommended

Flink vs Spark Streaming

Dimension	Flink	Spark Streaming
Processing Model	Native stream processing	Micro-batch
Latency	Millisecond-level	Second-level (batch interval)
State Management	Native built-in	Depends on RDD checkpoint
Event Time	Native support	Supported but complex
Batch Processing	Unified API	Separate Spark Core API
Learning Curve	Steeper	Unified with Spark ecosystem

Use Cases

Typical scenarios for choosing Flink:

Real-time risk control requiring millisecond-level end-to-end latency
Complex Event Processing (CEP) and multi-stream joins
Large-scale stateful stream computing (e.g., real-time recommendations, real-time features)
Financial scenarios requiring strict Exactly-Once semantics

Subsequent articles will progressively introduce Flink DataStream API programming, window operations, state management, and Checkpoint configuration.