Spark Streaming Integration with Kafka: Receiver and Dire...

Basic Introduction

For different Spark and Kafka versions, there are two ways to integrate and process data:

Receiver Approach
Direct Approach

Kafka-08 Interface - Receiver based Approach

The Receiver-based Kafka consumption method uses Kafka’s old version (0.8.2.1 and earlier) high-level consumer API. The overall workflow of this approach is:

1. Data Reception and Storage

Each Receiver runs as a long-running Task scheduled to an Executor
Received Kafka data is first stored in Spark Executor memory, managed through BlockManager
By default, every 200ms (controlled by spark.streaming.blockInterval parameter), accumulated data generates a Block
These Blocks are replicated to other Executors for fault tolerance (default replication factor is 2)

2. Data Processing Flow

When Spark Streaming periodically generates Jobs, it constructs BlockRDD based on these Blocks
These RDDs are ultimately executed as regular Spark tasks
Each Block corresponds to one RDD partition, so RDD partition count can be controlled by adjusting blockInterval

3. Key Features and Notes

Receiver Deployment Characteristics:

Each Receiver runs as a persistent thread on an Executor, continuously occupying one CPU core
Receiver count is determined by number of calls to KafkaUtils.createStream(), each call creates one Receiver
Can implement multiple Receivers parallel consumption through multiple createStream() calls

Parallelism Limitations:

Kafka Topic partition count has no direct correlation with Spark RDD partition count
Increasing Kafka Topic partition count only increases internal consumer thread count within a single Receiver
Actual Spark processing parallelism is still determined by Block count
Example: A Receiver consuming 4-partition Topic still generates only one RDD partition

Performance Considerations:

Data locality issue: Executor containing Receiver will be preferentially scheduled for Task execution, may cause cluster load imbalance
Default blockInterval is 200ms, can be adjusted based on data volume:
- Small data volume: Can appropriately increase interval to reduce overhead
- Large data volume: Can decrease interval to improve parallelism

Reliability Guarantee:

Under default configuration, Receiver approach may lose received but unprocessed data on failure
Can enable Write-Ahead Log (WAL) to improve reliability:
- Set spark.streaming.receiver.writeAheadLog.enable=true
- Data will first be written to reliable storage like HDFS
- But brings additional disk IO overhead, reduces throughput by about 10-20%

4. Typical Application Scenarios:

Suitable for scenarios with moderate throughput that are not latency-sensitive
When needing compatibility with Kafka 0.8.x old versions
When needing simple Exactly-once semantics can combine with WAL
Not suitable for scenarios requiring high throughput, low latency, or strict resource isolation

Kafka-08 Interface (Receiver Method)

Offsets saved in ZK, system managed
Corresponding Kafka version 0.8.2.1 +
Interface bottom implementation uses Kafka old version consumer high-level API
DStream bottom implementation is BlockRDD

Kafka-08 Interface (Receiver with WAL)

Enhanced fault recovery capability
Received data and Driver metadata saved to HDFS
Increases streaming application processing latency

Direct Approach

Direct Approach is Spark Streaming integrating with Kafka without using Receiver, more used in enterprise production environments. Compared to Receiver, has these characteristics:

Doesn’t use Receiver, reduces unnecessary CPU occupation, eliminates entire process of Receiver receiving data and writing to BlockManager, then at runtime getting data through BlockId, network transmission, disk reading, etc., improves efficiency, no need for WAL, further reduces disk IO
RDD generated by Direct method is KafkaRDD, its partition count matches Kafka partition count, easier to control parallelism. Note: After Shuffle or Repartition operations, this correspondence will be invalid
Can manually maintain Offset, implement Exactly Once semantics

Kafka-10 Interface

Spark Streaming integration with Kafka 0.10 is similar to Direct method in version 0.8, Kafka partitions and Spark RDD partitions correspond one-to-one, can get Offsets and metadata, API usage has no significant difference.

Add Dependency

<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-streaming-kafka-0-10_2.12</artifactId>
  <version>${spark.version}</version>
</dependency>