What is Flink
Apache Flink is an open-source stream processing framework and distributed processing engine, specifically used for efficient stateful computation on unbounded (streaming) and bounded (batch) data streams. Flink’s core design philosophy includes:
- Stream-First Architecture: Adopts “unified stream-batch” architecture, treating batch processing as a special case of stream processing
- Distributed Execution: Supports deployment in common cluster environments like YARN, Mesos, Kubernetes
- High-Performance Computing: Achieves low-latency, high-throughput data processing based on in-memory execution and optimized network communication
- Elastic Scaling: Can scale horizontally to thousands of nodes, processing PB-level data volumes
Flink Development History
- Origin: Started in 2008 as Stratosphere research project at TU Berlin, focusing on large-scale data processing technology
- Incubation: April 2014, Stratosphere project donated to Apache Software Foundation, entered incubation phase
- Maturation: December 2014, officially became Apache top-level project, renamed Flink (German meaning “quick” and “agile”)
- Development: After years of development, now one of the most active Apache projects, adopted by many well-known companies like Alibaba, Uber, Netflix
Flink Technical Features
- Exactly-once state consistency
- Event-time processing and Watermark mechanism
- Layered APIs: Including SQL/Table API, DataStream API, and ProcessFunction API
- Fault Tolerance: Based on lightweight distributed snapshots (Checkpoint) for fault recovery
Flink Ecosystem
Flink has a rich ecosystem:
- Connectors: Supports Kafka, HDFS, JDBC and many other data sources
- State Backends: Provides memory, file system, RocksDB storage options
- Deployment Modes: Supports Standalone, YARN, Kubernetes and other deployment methods
Flink Characteristics
Flink is an open-source batch processing framework with these characteristics:
- Unified batch and stream: Unified batch processing, stream processing
- Distributed: Flink programs can run on multiple servers
- High performance: Relatively high processing performance
- High availability: Flink supports high availability (HA)
- Accuracy: Flink can guarantee data processing accuracy
Flink Scenarios
Flink is mainly used for streaming data analysis scenarios. Data is everywhere, and most enterprise data processing frameworks are divided into two categories:
- Transaction processing
- Analytical processing
Transaction Processing
- OLTP: On-Line Transaction Processing
- Process approval, data entry, filling, etc.
- Characteristics: Offline work goes online, data saved in respective systems, not connected to each other (data silos)
OLTP online transaction processing system takes transaction elements as processing unit, computer application system for human-computer interaction. It can immediately update data or perform other operations, data in system always kept at latest state.
Main applications in:
- Airline booking
- Stock trading
- Supermarket sales
- Restaurant front/back office management, etc.
Common: ERP, CRM, OA systems all belong to OLTP systems.
Analytical Processing
When data accumulates to a certain extent, we need to summarize and analyze what happened in the past. We need to take out data generated over a period of time for statistical analysis to obtain information we want, providing support for company decision-making—this is doing OLAP.
OLAP On-Line Analytical Processing:
- Analysis reports
- Analysis decisions
- And so on
ETL (Extract-Transform-Load): Extracts data from transactional data, transforms to common representation (may include data validation, data normalization, encoding, deduplication, table schema transformation, etc.), ultimately loads to analytical database.
Usually data warehouse queries can be divided into two types:
- Regular queries: Customized
- Ad-hoc queries: User-defined query conditions
Typical application scenarios:
- Real-time ETL: Integrates existing data channels and SQL flexible processing capabilities, real-time cleaning, merging and structured processing of streaming data
- Real-time reports: Real-time collection, processing streaming data storage, real-time monitoring and displaying various business/customer metrics
- Monitoring alerts: Real-time monitoring and analysis of system and user behavior to timely discover dangerous behaviors
- Online systems: Real-time calculation of various data indicators, using real-time results to timely adjust related strategies of online systems
Flink Core Components
Deploy Layer
- Can start single JVM to run Flink in Local mode
- Flink can also run in Standalone cluster mode, while supporting Flink on YARN, Flink application directly submitted to YARN
- Flink can also run on Google Cloud Services and Amazon Cloud Services
Core Layer
Provides two core APIs above Runtime:
- DataStream API (stream processing)
- DataSet API (batch processing)
APIs & Libraries Layer
Extends some high-level libraries and APIs on core APIs:
- CEP stream processing
- Table API and SQL
- Flink ML machine learning library
- Gelly graph computing
Flink Ecosystem Development
- Input Connectors (left part):
- Streaming processing includes Kafka (message queue), AWS Kinesis (real-time data stream service), RabbitMQ (message queue), NIFI (data pipeline), Cassandra (NoSQL database), Elasticsearch (full-text search), HDFS (rolling files)
- Batch processing methods: Include HBase (distributed columnar database), HDFS (distributed file system)
Flink Processing Model
Flink stream processing and batch processing, Flink focuses on unbounded stream processing, bounded stream processing is a special case of unbounded stream processing.
Unbounded Stream Processing
- Input data has no end, like water flow continuously
- Data processing starts from current or past time point, continues uninterrupted
Bounded Stream Processing
- Processes data starting from a certain time point, then ends at another time point
- Input data may be inherently bounded (i.e., input dataset doesn’t grow with time), or artificially set as bounded for analysis purposes
Flink encapsulates DataStream API for stream processing, encapsulates DataSet API for batch processing. At the same time, Flink is a unified stream-batch processing engine, providing Table API/SQL for unified batch and stream processing.
Stream Processing Engine Technology Selection
There are many stream processing engines on the market besides Flink—Storm, SparkStreaming, Trident, etc. How to select in actual applications:
- If stream data needs state management, choose Trident, SparkStreaming, or Flink
- If message delivery needs to guarantee At-least-once or Exactly-once, cannot choose Storm
- For small independent projects with low latency requirements, can choose Storm, simpler
- If project already uses big framework Spark, and real-time processing requirements can be met, recommend directly using SparkStreaming
- If message delivery needs Exactly-once, large data volume, high throughput, low latency requirements, need state management or window statistics, recommend using Flink
Architecture Components
JobManager
JobManager is the core control component of Flink cluster, responsible for entire data flow processing job lifecycle management. Its main responsibilities include:
- Task Scheduling: JobManager divides user-submitted jobs into multiple tasks and schedules these tasks to different TaskManagers for execution
- Resource Management: It interacts with resource management systems (like YARN or Kubernetes) to allocate and manage resources required for job execution
- Fault Recovery: When a task in the job fails, JobManager is responsible for rescheduling that task and recovering execution from failure point
- Checkpointing Coordination: JobManager coordinates Flink’s fault tolerance mechanism, ensuring job state consistency through managing Checkpointing
TaskManager
TaskManager is the working node in Flink cluster, responsible for executing specific tasks assigned by JobManager. Its responsibilities include:
- Task Execution: TaskManager accepts tasks assigned by JobManager and executes them. Each TaskManager can execute multiple task instances simultaneously
- State Management: In stateful stream processing applications, TaskManager is responsible for managing task’s local state
- Data Transfer: TaskManager is responsible for transferring data between different tasks
Dispatcher
Dispatcher is a relatively new component. Its main responsibility is processing jobs submitted by clients and assigning these jobs to JobManager in the cluster for processing. Dispatcher also manages Flink cluster’s REST API
ResourceManager
ResourceManager is responsible for interacting with cluster managers (like YARN, Kubernetes, Standalone, etc.), managing resources required for Flink jobs
Client
Client is the entry point for user interaction with Flink cluster. User submits jobs to Dispatcher through client
Flink Runtime
Flink Runtime is where Flink’s core data processing engine resides. It is responsible for processing data streams and executing user-defined operations
State Backend
State Backend is the module in Flink used for storing task state. There are two main state backends:
- Memory State Backend: Stores state in TaskManager memory, suitable for small-scale jobs
- RocksDB State Backend: Stores state in embedded RocksDB database, suitable for large-scale, stateful stream processing applications
Checkpointing and Savepoints
Flink provides two mechanisms for fault tolerance:
- Checkpointing: Periodically saves task state to distributed storage to ensure recovery from latest checkpoint when faults occur
- Savepoints: User-triggered state snapshots, can be used during program upgrade or redeployment
Data Stream and Data Set API
- DataStream API: Used for stream processing, supports unbounded and bounded data streams
- DataSet API: Used for batch processing, supports bounded dataset processing
Execution Graph
When a Flink job is submitted, it is transformed into an Execution Graph. Execution Graph describes tasks in the job and their dependencies