This is article 90 in the Big Data series, serving as the starting point of the Flink series, comprehensively introducing Apache Flink’s design philosophy, technical features, and architecture components.

Apache Flink is an open-source distributed stream processing framework, specifically designed for high-performance, low-latency processing of both bounded (batch) and unbounded (stream) data.

Flink’s core design philosophy is unified stream-batch: Batch processing is treated as a special case of stream processing (bounded finite stream), not as two separate computing models. This allows the same API and runtime to process both real-time streaming data and offline batch data.

Development History

TimeEvent
2008Stratosphere research project started at TU Berlin
April 2014Donated to Apache Software Foundation
December 2014Became Apache top-level project, officially named Flink (German: flexible, agile)
PresentLarge-scale production use by Alibaba, Uber, Netflix, etc.

Alibaba is one of Flink’s most important enterprise contributors—their internally maintained Blink version has been contributed back to the community mainline.

Core Technical Features

1. Exactly-Once State Consistency

Flink implements fault tolerance based on lightweight distributed snapshots (Chandy-Lamport algorithm), ensuring precise recovery to the state from the last Checkpoint upon node failure—each record processed exactly once.

2. Event Time Processing

Flink natively supports Event Time semantics, handling out-of-order data through Watermark mechanism, precisely processing data by generation time rather than system arrival time.

3. Layered API Architecture

SQL / Table API    ← Declarative, suitable for analysts
    DataStream     ← Core stream processing API
      ProcessFunction ← Lowest level, complete control

4. Memory Management

Flink manages JVM memory itself (TypedSerializer + MemorySegment), avoiding GC pressure—significantly better memory efficiency than Spark in large data scenarios.

5. Elastic Scalability

Horizontally scales to thousands of nodes, can process PB-level data with millisecond-level latency.

Processing Models: Bounded vs Unbounded

Unbounded Streams:

  • Data continuously generated, no natural endpoint
  • Must be processed in real-time, 7×24 operation
  • Typical sources: Kafka, message queues, sensor data

Bounded Streams:

  • Have clear start and end
  • Can wait for all data to arrive before processing
  • Equivalent to traditional batch processing, terminates after completion

Architecture Components

JobManager

The cluster’s “brain”, responsible for:

  • Receiving and parsing job graphs (JobGraph → ExecutionGraph)
  • Task scheduling and resource allocation
  • Checkpoint coordination and fault recovery
  • Monitoring TaskManager heartbeats

TaskManager

The “worker” executing tasks, responsible for:

  • Running assigned Tasks (SubTasks)
  • Managing local state
  • Handling network data transfer between tasks
  • Reporting status to JobManager

Dispatcher

Job submission entry point, provides REST API interface, forwards submitted jobs to corresponding JobManager. Also runs Flink Web UI.

ResourceManager

Interfaces with underlying cluster management systems (YARN, Kubernetes, Standalone), responsible for computing resource (TaskSlot) application, allocation, and release.

State Backend

State storage abstraction layer with three implementations:

BackendStorage LocationUse Case
MemoryStateBackendJVM HeapDevelopment/testing
FsStateBackendFile system + HeapMedium state scale
RocksDBStateBackendRocksDB (local disk)Ultra-large state, production recommended
DimensionFlinkSpark Streaming
Processing ModelNative stream processingMicro-batch
LatencyMillisecond-levelSecond-level (batch interval)
State ManagementNative built-inDepends on RDD checkpoint
Event TimeNative supportSupported but complex
Batch ProcessingUnified APISeparate Spark Core API
Learning CurveSteeperUnified with Spark ecosystem

Use Cases

Typical scenarios for choosing Flink:

  • Real-time risk control requiring millisecond-level end-to-end latency
  • Complex Event Processing (CEP) and multi-stream joins
  • Large-scale stateful stream computing (e.g., real-time recommendations, real-time features)
  • Financial scenarios requiring strict Exactly-Once semantics

Subsequent articles will progressively introduce Flink DataStream API programming, window operations, state management, and Checkpoint configuration.