Flink Parallelism Deep Dive: From Concepts to Best Practices

Flink Parallelism In Detail

Basic Concept of Parallelism

In Apache Flink, Parallelism refers to the number of parallel tasks that can run simultaneously for each operator during execution. Flink program’s data flow graph consists of multiple operators (including Source, Transformation and Sink), each operator can independently set its parallelism.

Parallelism Setting Methods

1. Global Parallelism Setting:

env.setParallelism(4);

2. Operator Level Parallelism:

DataStream<String> data = env.addSource(new CustomSource())
    .map(new MyMapper()).setParallelism(2)
    .filter(new MyFilter()).setParallelism(3);

3. Configuration File Setting (flink-conf.yaml):

parallelism.default: 3

4. Client Submission Parameter:

./bin/flink run -p 5 -c com.example.MyJob myJob.jar

Parallelism Priority

Operator level (highest)
Environment level (env.setParallelism())
Client level (-p parameter)
System default level (flink-conf.yaml)

Relationship Between Parallelism and Slot

Slot is a static resource concept, representing actual concurrent execution capability of TaskManager
Parallelism is a dynamic execution concept, representing actual concurrency used during program execution
Total parallelism should not exceed total cluster slot count

Best Practices

Set reasonable parallelism based on data volume
Reasonably allocate parallelism for different operators
Consider network and I/O limitations
Batch processing jobs can set higher parallelism, streaming jobs recommend maintaining stable parallelism