Apache Flink Deep Dive: From Origin to Technical Features

What is Flink

Apache Flink is an open-source stream processing framework and distributed processing engine, specifically used for efficient stateful computation on unbounded (streaming) and bounded (batch) data streams. Flink’s core design philosophy includes:

Stream-First Architecture: Adopts “unified stream-batch” architecture, treating batch processing as a special case of stream processing
Distributed Execution: Supports deployment in common cluster environments like YARN, Mesos, Kubernetes
High-Performance Computing: Achieves low-latency, high-throughput data processing based on in-memory execution and optimized network communication
Elastic Scaling: Can scale horizontally to thousands of nodes, processing PB-level data volumes

Flink Development History

Origin: Started in 2008 as Stratosphere research project at TU Berlin, focusing on large-scale data processing technology
Incubation: April 2014, Stratosphere project donated to Apache Software Foundation, entered incubation phase
Maturation: December 2014, officially became Apache top-level project, renamed Flink (German meaning “quick” and “agile”)
Development: After years of development, now one of the most active Apache projects, adopted by many well-known companies like Alibaba, Uber, Netflix

Flink Technical Features

Exactly-once state consistency
Event-time processing and Watermark mechanism
Layered APIs: Including SQL/Table API, DataStream API, and ProcessFunction API
Fault Tolerance: Based on lightweight distributed snapshots (Checkpoint) for fault recovery

Flink Ecosystem

Flink has a rich ecosystem:

Connectors: Supports Kafka, HDFS, JDBC and many other data sources
State Backends: Provides memory, file system, RocksDB storage options
Deployment Modes: Supports Standalone, YARN, Kubernetes and other deployment methods

Flink Characteristics

Flink is an open-source batch processing framework with these characteristics:

Unified batch and stream: Unified batch processing, stream processing
Distributed: Flink programs can run on multiple servers
High performance: Relatively high processing performance
High availability: Flink supports high availability (HA)
Accuracy: Flink can guarantee data processing accuracy

Flink Scenarios

Flink is mainly used for streaming data analysis scenarios. Data is everywhere, and most enterprise data processing frameworks are divided into two categories:

Transaction processing
Analytical processing

Transaction Processing

OLTP: On-Line Transaction Processing
Process approval, data entry, filling, etc.
Characteristics: Offline work goes online, data saved in respective systems, not connected to each other (data silos)

OLTP online transaction processing system takes transaction elements as processing unit, computer application system for human-computer interaction. It can immediately update data or perform other operations, data in system always kept at latest state.

Main applications in:

Airline booking
Stock trading
Supermarket sales
Restaurant front/back office management, etc.

Common: ERP, CRM, OA systems all belong to OLTP systems.

Analytical Processing

When data accumulates to a certain extent, we need to summarize and analyze what happened in the past. We need to take out data generated over a period of time for statistical analysis to obtain information we want, providing support for company decision-making—this is doing OLAP.

OLAP On-Line Analytical Processing:

Analysis reports
Analysis decisions
And so on

ETL (Extract-Transform-Load): Extracts data from transactional data, transforms to common representation (may include data validation, data normalization, encoding, deduplication, table schema transformation, etc.), ultimately loads to analytical database.

Usually data warehouse queries can be divided into two types:

Regular queries: Customized
Ad-hoc queries: User-defined query conditions

Typical application scenarios:

Real-time ETL: Integrates existing data channels and SQL flexible processing capabilities, real-time cleaning, merging and structured processing of streaming data
Real-time reports: Real-time collection, processing streaming data storage, real-time monitoring and displaying various business/customer metrics
Monitoring alerts: Real-time monitoring and analysis of system and user behavior to timely discover dangerous behaviors
Online systems: Real-time calculation of various data indicators, using real-time results to timely adjust related strategies of online systems

Flink Core Components

Deploy Layer

Can start single JVM to run Flink in Local mode
Flink can also run in Standalone cluster mode, while supporting Flink on YARN, Flink application directly submitted to YARN
Flink can also run on Google Cloud Services and Amazon Cloud Services

Core Layer

Provides two core APIs above Runtime:

DataStream API (stream processing)
DataSet API (batch processing)

APIs & Libraries Layer

Extends some high-level libraries and APIs on core APIs:

CEP stream processing
Table API and SQL
Flink ML machine learning library
Gelly graph computing

Flink Ecosystem Development

Input Connectors (left part):
1. Streaming processing includes Kafka (message queue), AWS Kinesis (real-time data stream service), RabbitMQ (message queue), NIFI (data pipeline), Cassandra (NoSQL database), Elasticsearch (full-text search), HDFS (rolling files)
2. Batch processing methods: Include HBase (distributed columnar database), HDFS (distributed file system)

Flink Processing Model

Flink stream processing and batch processing, Flink focuses on unbounded stream processing, bounded stream processing is a special case of unbounded stream processing.

Unbounded Stream Processing

Input data has no end, like water flow continuously
Data processing starts from current or past time point, continues uninterrupted

Bounded Stream Processing

Processes data starting from a certain time point, then ends at another time point
Input data may be inherently bounded (i.e., input dataset doesn’t grow with time), or artificially set as bounded for analysis purposes

Flink encapsulates DataStream API for stream processing, encapsulates DataSet API for batch processing. At the same time, Flink is a unified stream-batch processing engine, providing Table API/SQL for unified batch and stream processing.

Stream Processing Engine Technology Selection

There are many stream processing engines on the market besides Flink—Storm, SparkStreaming, Trident, etc. How to select in actual applications:

If stream data needs state management, choose Trident, SparkStreaming, or Flink
If message delivery needs to guarantee At-least-once or Exactly-once, cannot choose Storm
For small independent projects with low latency requirements, can choose Storm, simpler
If project already uses big framework Spark, and real-time processing requirements can be met, recommend directly using SparkStreaming
If message delivery needs Exactly-once, large data volume, high throughput, low latency requirements, need state management or window statistics, recommend using Flink

Architecture Components

JobManager

JobManager is the core control component of Flink cluster, responsible for entire data flow processing job lifecycle management. Its main responsibilities include:

Task Scheduling: JobManager divides user-submitted jobs into multiple tasks and schedules these tasks to different TaskManagers for execution
Resource Management: It interacts with resource management systems (like YARN or Kubernetes) to allocate and manage resources required for job execution
Fault Recovery: When a task in the job fails, JobManager is responsible for rescheduling that task and recovering execution from failure point
Checkpointing Coordination: JobManager coordinates Flink’s fault tolerance mechanism, ensuring job state consistency through managing Checkpointing

TaskManager

TaskManager is the working node in Flink cluster, responsible for executing specific tasks assigned by JobManager. Its responsibilities include:

Task Execution: TaskManager accepts tasks assigned by JobManager and executes them. Each TaskManager can execute multiple task instances simultaneously
State Management: In stateful stream processing applications, TaskManager is responsible for managing task’s local state
Data Transfer: TaskManager is responsible for transferring data between different tasks

Dispatcher

Dispatcher is a relatively new component. Its main responsibility is processing jobs submitted by clients and assigning these jobs to JobManager in the cluster for processing. Dispatcher also manages Flink cluster’s REST API

ResourceManager

ResourceManager is responsible for interacting with cluster managers (like YARN, Kubernetes, Standalone, etc.), managing resources required for Flink jobs

Client

Client is the entry point for user interaction with Flink cluster. User submits jobs to Dispatcher through client

Flink Runtime

Flink Runtime is where Flink’s core data processing engine resides. It is responsible for processing data streams and executing user-defined operations

State Backend

State Backend is the module in Flink used for storing task state. There are two main state backends:

Memory State Backend: Stores state in TaskManager memory, suitable for small-scale jobs
RocksDB State Backend: Stores state in embedded RocksDB database, suitable for large-scale, stateful stream processing applications

Checkpointing and Savepoints

Flink provides two mechanisms for fault tolerance:

Checkpointing: Periodically saves task state to distributed storage to ensure recovery from latest checkpoint when faults occur
Savepoints: User-triggered state snapshots, can be used during program upgrade or redeployment

Data Stream and Data Set API

DataStream API: Used for stream processing, supports unbounded and bounded data streams
DataSet API: Used for batch processing, supports bounded dataset processing

Execution Graph

When a Flink job is submitted, it is transformed into an Execution Graph. Execution Graph describes tasks in the job and their dependencies