Basic Concept of Parallelism

In Apache Flink, Parallelism refers to the number of parallel tasks that can run simultaneously for each operator during execution. Flink program’s data flow graph consists of multiple operators (including Source, Transformation and Sink), each operator can independently set its parallelism.

Parallelism Setting Methods

1. Global Parallelism Setting:

env.setParallelism(4);

2. Operator Level Parallelism:

DataStream<String> data = env.addSource(new CustomSource())
    .map(new MyMapper()).setParallelism(2)
    .filter(new MyFilter()).setParallelism(3);

3. Configuration File Setting (flink-conf.yaml):

parallelism.default: 3

4. Client Submission Parameter:

./bin/flink run -p 5 -c com.example.MyJob myJob.jar

Parallelism Priority

  1. Operator level (highest)
  2. Environment level (env.setParallelism())
  3. Client level (-p parameter)
  4. System default level (flink-conf.yaml)

Relationship Between Parallelism and Slot

  • Slot is a static resource concept, representing actual concurrent execution capability of TaskManager
  • Parallelism is a dynamic execution concept, representing actual concurrency used during program execution
  • Total parallelism should not exceed total cluster slot count

Best Practices

  • Set reasonable parallelism based on data volume
  • Reasonably allocate parallelism for different operators
  • Consider network and I/O limitations
  • Batch processing jobs can set higher parallelism, streaming jobs recommend maintaining stable parallelism