Apache Flink is an open-source stream processing framework and distributed processing engine, specifically used for efficient stateful computation on unbounded (streaming) and bounded (batch) data streams. Flink’s core design philosophy includes:

  1. Stream-First Architecture: Adopts “unified stream-batch” architecture, treating batch processing as a special case of stream processing
  2. Distributed Execution: Supports deployment in common cluster environments like YARN, Mesos, Kubernetes
  3. High-Performance Computing: Achieves low-latency, high-throughput data processing based on in-memory execution and optimized network communication
  4. Elastic Scaling: Can scale horizontally to thousands of nodes, processing PB-level data volumes
  • Origin: Started in 2008 as Stratosphere research project at TU Berlin, focusing on large-scale data processing technology
  • Incubation: April 2014, Stratosphere project donated to Apache Software Foundation, entered incubation phase
  • Maturation: December 2014, officially became Apache top-level project, renamed Flink (German meaning “quick” and “agile”)
  • Development: After years of development, now one of the most active Apache projects, adopted by many well-known companies like Alibaba, Uber, Netflix
  1. Exactly-once state consistency
  2. Event-time processing and Watermark mechanism
  3. Layered APIs: Including SQL/Table API, DataStream API, and ProcessFunction API
  4. Fault Tolerance: Based on lightweight distributed snapshots (Checkpoint) for fault recovery

Flink has a rich ecosystem:

  • Connectors: Supports Kafka, HDFS, JDBC and many other data sources
  • State Backends: Provides memory, file system, RocksDB storage options
  • Deployment Modes: Supports Standalone, YARN, Kubernetes and other deployment methods

Flink is an open-source batch processing framework with these characteristics:

  • Unified batch and stream: Unified batch processing, stream processing
  • Distributed: Flink programs can run on multiple servers
  • High performance: Relatively high processing performance
  • High availability: Flink supports high availability (HA)
  • Accuracy: Flink can guarantee data processing accuracy

Flink is mainly used for streaming data analysis scenarios. Data is everywhere, and most enterprise data processing frameworks are divided into two categories:

  • Transaction processing
  • Analytical processing

Transaction Processing

  • OLTP: On-Line Transaction Processing
  • Process approval, data entry, filling, etc.
  • Characteristics: Offline work goes online, data saved in respective systems, not connected to each other (data silos)

OLTP online transaction processing system takes transaction elements as processing unit, computer application system for human-computer interaction. It can immediately update data or perform other operations, data in system always kept at latest state.

Main applications in:

  • Airline booking
  • Stock trading
  • Supermarket sales
  • Restaurant front/back office management, etc.

Common: ERP, CRM, OA systems all belong to OLTP systems.

Analytical Processing

When data accumulates to a certain extent, we need to summarize and analyze what happened in the past. We need to take out data generated over a period of time for statistical analysis to obtain information we want, providing support for company decision-making—this is doing OLAP.

OLAP On-Line Analytical Processing:

  • Analysis reports
  • Analysis decisions
  • And so on

ETL (Extract-Transform-Load): Extracts data from transactional data, transforms to common representation (may include data validation, data normalization, encoding, deduplication, table schema transformation, etc.), ultimately loads to analytical database.

Usually data warehouse queries can be divided into two types:

  • Regular queries: Customized
  • Ad-hoc queries: User-defined query conditions

Typical application scenarios:

  • Real-time ETL: Integrates existing data channels and SQL flexible processing capabilities, real-time cleaning, merging and structured processing of streaming data
  • Real-time reports: Real-time collection, processing streaming data storage, real-time monitoring and displaying various business/customer metrics
  • Monitoring alerts: Real-time monitoring and analysis of system and user behavior to timely discover dangerous behaviors
  • Online systems: Real-time calculation of various data indicators, using real-time results to timely adjust related strategies of online systems

Deploy Layer

  • Can start single JVM to run Flink in Local mode
  • Flink can also run in Standalone cluster mode, while supporting Flink on YARN, Flink application directly submitted to YARN
  • Flink can also run on Google Cloud Services and Amazon Cloud Services

Core Layer

Provides two core APIs above Runtime:

  • DataStream API (stream processing)
  • DataSet API (batch processing)

APIs & Libraries Layer

Extends some high-level libraries and APIs on core APIs:

  • CEP stream processing
  • Table API and SQL
  • Flink ML machine learning library
  • Gelly graph computing
  • Input Connectors (left part):
    1. Streaming processing includes Kafka (message queue), AWS Kinesis (real-time data stream service), RabbitMQ (message queue), NIFI (data pipeline), Cassandra (NoSQL database), Elasticsearch (full-text search), HDFS (rolling files)
    2. Batch processing methods: Include HBase (distributed columnar database), HDFS (distributed file system)

Flink stream processing and batch processing, Flink focuses on unbounded stream processing, bounded stream processing is a special case of unbounded stream processing.

Unbounded Stream Processing

  • Input data has no end, like water flow continuously
  • Data processing starts from current or past time point, continues uninterrupted

Bounded Stream Processing

  • Processes data starting from a certain time point, then ends at another time point
  • Input data may be inherently bounded (i.e., input dataset doesn’t grow with time), or artificially set as bounded for analysis purposes

Flink encapsulates DataStream API for stream processing, encapsulates DataSet API for batch processing. At the same time, Flink is a unified stream-batch processing engine, providing Table API/SQL for unified batch and stream processing.

Stream Processing Engine Technology Selection

There are many stream processing engines on the market besides Flink—Storm, SparkStreaming, Trident, etc. How to select in actual applications:

  • If stream data needs state management, choose Trident, SparkStreaming, or Flink
  • If message delivery needs to guarantee At-least-once or Exactly-once, cannot choose Storm
  • For small independent projects with low latency requirements, can choose Storm, simpler
  • If project already uses big framework Spark, and real-time processing requirements can be met, recommend directly using SparkStreaming
  • If message delivery needs Exactly-once, large data volume, high throughput, low latency requirements, need state management or window statistics, recommend using Flink

Architecture Components

JobManager

JobManager is the core control component of Flink cluster, responsible for entire data flow processing job lifecycle management. Its main responsibilities include:

  • Task Scheduling: JobManager divides user-submitted jobs into multiple tasks and schedules these tasks to different TaskManagers for execution
  • Resource Management: It interacts with resource management systems (like YARN or Kubernetes) to allocate and manage resources required for job execution
  • Fault Recovery: When a task in the job fails, JobManager is responsible for rescheduling that task and recovering execution from failure point
  • Checkpointing Coordination: JobManager coordinates Flink’s fault tolerance mechanism, ensuring job state consistency through managing Checkpointing

TaskManager

TaskManager is the working node in Flink cluster, responsible for executing specific tasks assigned by JobManager. Its responsibilities include:

  • Task Execution: TaskManager accepts tasks assigned by JobManager and executes them. Each TaskManager can execute multiple task instances simultaneously
  • State Management: In stateful stream processing applications, TaskManager is responsible for managing task’s local state
  • Data Transfer: TaskManager is responsible for transferring data between different tasks

Dispatcher

Dispatcher is a relatively new component. Its main responsibility is processing jobs submitted by clients and assigning these jobs to JobManager in the cluster for processing. Dispatcher also manages Flink cluster’s REST API

ResourceManager

ResourceManager is responsible for interacting with cluster managers (like YARN, Kubernetes, Standalone, etc.), managing resources required for Flink jobs

Client

Client is the entry point for user interaction with Flink cluster. User submits jobs to Dispatcher through client

Flink Runtime is where Flink’s core data processing engine resides. It is responsible for processing data streams and executing user-defined operations

State Backend

State Backend is the module in Flink used for storing task state. There are two main state backends:

  • Memory State Backend: Stores state in TaskManager memory, suitable for small-scale jobs
  • RocksDB State Backend: Stores state in embedded RocksDB database, suitable for large-scale, stateful stream processing applications

Checkpointing and Savepoints

Flink provides two mechanisms for fault tolerance:

  • Checkpointing: Periodically saves task state to distributed storage to ensure recovery from latest checkpoint when faults occur
  • Savepoints: User-triggered state snapshots, can be used during program upgrade or redeployment

Data Stream and Data Set API

  • DataStream API: Used for stream processing, supports unbounded and bounded data streams
  • DataSet API: Used for batch processing, supports bounded dataset processing

Execution Graph

When a Flink job is submitted, it is transformed into an Execution Graph. Execution Graph describes tasks in the job and their dependencies