Basic Introduction
For different Spark and Kafka versions, there are two ways to integrate and process data:
- Receiver Approach
- Direct Approach
Kafka-08 Interface - Receiver based Approach
The Receiver-based Kafka consumption method uses Kafka’s old version (0.8.2.1 and earlier) high-level consumer API. The overall workflow of this approach is:
1. Data Reception and Storage
- Each Receiver runs as a long-running Task scheduled to an Executor
- Received Kafka data is first stored in Spark Executor memory, managed through BlockManager
- By default, every 200ms (controlled by spark.streaming.blockInterval parameter), accumulated data generates a Block
- These Blocks are replicated to other Executors for fault tolerance (default replication factor is 2)
2. Data Processing Flow
- When Spark Streaming periodically generates Jobs, it constructs BlockRDD based on these Blocks
- These RDDs are ultimately executed as regular Spark tasks
- Each Block corresponds to one RDD partition, so RDD partition count can be controlled by adjusting blockInterval
3. Key Features and Notes
Receiver Deployment Characteristics:
- Each Receiver runs as a persistent thread on an Executor, continuously occupying one CPU core
- Receiver count is determined by number of calls to KafkaUtils.createStream(), each call creates one Receiver
- Can implement multiple Receivers parallel consumption through multiple createStream() calls
Parallelism Limitations:
- Kafka Topic partition count has no direct correlation with Spark RDD partition count
- Increasing Kafka Topic partition count only increases internal consumer thread count within a single Receiver
- Actual Spark processing parallelism is still determined by Block count
- Example: A Receiver consuming 4-partition Topic still generates only one RDD partition
Performance Considerations:
- Data locality issue: Executor containing Receiver will be preferentially scheduled for Task execution, may cause cluster load imbalance
- Default blockInterval is 200ms, can be adjusted based on data volume:
- Small data volume: Can appropriately increase interval to reduce overhead
- Large data volume: Can decrease interval to improve parallelism
Reliability Guarantee:
- Under default configuration, Receiver approach may lose received but unprocessed data on failure
- Can enable Write-Ahead Log (WAL) to improve reliability:
- Set spark.streaming.receiver.writeAheadLog.enable=true
- Data will first be written to reliable storage like HDFS
- But brings additional disk IO overhead, reduces throughput by about 10-20%
4. Typical Application Scenarios:
- Suitable for scenarios with moderate throughput that are not latency-sensitive
- When needing compatibility with Kafka 0.8.x old versions
- When needing simple Exactly-once semantics can combine with WAL
- Not suitable for scenarios requiring high throughput, low latency, or strict resource isolation
Kafka-08 Interface (Receiver Method)
- Offsets saved in ZK, system managed
- Corresponding Kafka version 0.8.2.1 +
- Interface bottom implementation uses Kafka old version consumer high-level API
- DStream bottom implementation is BlockRDD
Kafka-08 Interface (Receiver with WAL)
- Enhanced fault recovery capability
- Received data and Driver metadata saved to HDFS
- Increases streaming application processing latency
Direct Approach
Direct Approach is Spark Streaming integrating with Kafka without using Receiver, more used in enterprise production environments. Compared to Receiver, has these characteristics:
- Doesn’t use Receiver, reduces unnecessary CPU occupation, eliminates entire process of Receiver receiving data and writing to BlockManager, then at runtime getting data through BlockId, network transmission, disk reading, etc., improves efficiency, no need for WAL, further reduces disk IO
- RDD generated by Direct method is KafkaRDD, its partition count matches Kafka partition count, easier to control parallelism. Note: After Shuffle or Repartition operations, this correspondence will be invalid
- Can manually maintain Offset, implement Exactly Once semantics
Kafka-10 Interface
Spark Streaming integration with Kafka 0.10 is similar to Direct method in version 0.8, Kafka partitions and Spark RDD partitions correspond one-to-one, can get Offsets and metadata, API usage has no significant difference.
Add Dependency
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.12</artifactId>
<version>${spark.version}</version>
</dependency>