This article mainly explains Flink State core concepts and practices. Flink divides computation into two types based on whether intermediate results need to be saved: stateful computation and stateless computation.
State Types:
- Stateful Computation: Needs to save intermediate results or state, depends on previous or subsequent events for data processing. Typical application scenarios include window aggregation, pattern detection, data deduplication, etc. State backend supports MemoryStateBackend, FsStateBackend, RocksDBStateBackend.
- Stateless Computation: Each data record is processed independently, no intermediate state needs to be saved. Typical operators include Map, Filter, FlatMap.
Flink State Classification:
- ValueState: Single value state, can update state value through update method, get state value through value() method
- ListState: State value is a list, can use add method to add values to the list
- ReducingState: Merges to a single state value through ReduceFunction
- MapState: State value is a Map, user adds elements through put and putAll methods
Keyed State vs Operator State:
- KeyedState: State associated with specific key, can only be applied to KeyedStream type datasets. Core features include key association, automatic partitioning, efficient access. Managed through KeyGroups mechanism, supports dynamic scaling and fault recovery.
- OperatorState: Bound to operator instances, does not depend on keys in data. Supports dynamic scaling up/down, provides three redistribution strategies: even distribution, full broadcast, custom distribution.
State Management Forms:
- Managed State: Unified management by Flink runtime environment, automatically converted to efficient memory data structures, provides exactly-once state guarantee through Checkpoint mechanism
- Raw State: Operator needs to maintain state data structures themselves, needs to implement state serialization logic
State Backend Comparison:
- MemoryStateBackend: Stored in JVM heap memory, fastest speed, suitable for testing/small state
- FsStateBackend: Stored in file system, suitable for conventional production environments, single task can reach TB level
- RocksDBStateBackend: Stored in RocksDB, suitable for super large state scenarios, only limited by disk capacity