This is article 85 in the Big Data series, introducing the architecture and evolution background of Spark’s two generations of streaming frameworks.
Full illustrated version: CSDN Original | Juejin
Real-Time Computing Background
In batch processing, Spark Core’s RDD and SparkSQL are very mature. But real-time requirements are prevalent in business scenarios:
- Real-time monitoring and alerting
- Log streaming analysis
- Real-time financial transaction risk control
- Social media sentiment analysis
Spark launched two generations of streaming frameworks to solve these problems: Spark Streaming (DStream) and Structured Streaming.
Generation 1: DStream Micro-Batch
Core Concepts
DStream (Discretized Stream) is the high-level abstraction of Spark Streaming, representing continuous data stream as a series of consecutive RDDs, each containing data arriving within a specific time interval.
Supported Data Sources
- External Sources: Kafka, Flume, HDFS/S3, Socket, Twitter API
- Internal Sources: Transform existing DStream to generate new DStream
Supported Operations
| Operation Type | Examples |
|---|---|
| Transformation | map, filter, reduceByKey, join |
| Output Operations | Write to file system, database, Dashboard |
| Window Operations | window, reduceByKeyAndWindow |
Micro-Batch Architecture
Data Stream (Kafka/Socket/...)
↓
Spark Streaming Receiver
↓
Split by Batch Interval
↓
RDD (time slice 1) → Process → Output
RDD (time slice 2) → Process → Output
RDD (time slice 3) → Process → Output
...
Batch Interval Configuration
Batch Interval is the core parameter of DStream, determining the time window size for each batch processing:
| Scenario | Recommended Interval |
|---|---|
| Real-time monitoring/alerting | 500ms ~ 1s |
| Log analysis | 5s ~ 10s |
| Financial transaction processing | 1s ~ 2s |
Trade-off: Smaller intervals mean lower latency but higher system overhead; larger intervals mean higher throughput but increased latency.
DStream Code Example
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka010._
val conf = new SparkConf().setMaster("local[2]").setAppName("StreamingDemo")
val ssc = new StreamingContext(conf, Seconds(1))
// Read from Kafka
val kafkaStream = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topics, kafkaParams)
)
// Real-time WordCount
val wordCounts = kafkaStream
.flatMap(record => record.value().split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
DStream Advantages
- Exactly-Once Semantics: Process each record exactly once, ensuring data accuracy
- Seamless Integration with Batch: Both use RDD under the hood, low switching cost
- High Throughput: Can process millions of records per second
- Automatic Fault Tolerance: Automatic recovery on failure
DStream Limitations
As business complexity increased, DStream revealed three core problems:
1. EventTime Processing Difficulty
DStream can only process by BatchTime (when data arrives at Spark), cannot perform window calculations based on EventTime (when events actually occurred).
In real scenarios, due to network delays, data may arrive out of order. DStream cannot handle this gracefully.
2. Inconsistent Batch/Stream APIs
Batch processing uses SparkSQL/DataFrame API, stream processing uses DStream API—developers need to learn and maintain two codebases.
3. End-to-End Consistency User-Responsibility
Exactly-Once semantics require users to manually manage Kafka offsets, idempotent writes, etc.—complex to implement.
Generation 2: Structured Streaming
Core Design Concept
Structured Streaming abstracts streaming data source as an Unbounded Table:
Kafka message stream → As unbounded table with continuously appended rows
SQL/DataFrame query → Operates on that unbounded table
Incremental computation → Writes results of new data to result table
This way, batch and stream processing can use the exact same SparkSQL/DataFrame API.
Improvements Over DStream
| Dimension | DStream | Structured Streaming |
|---|---|---|
| API | Separate DStream API | Unified DataFrame/SQL API |
| EventTime Support | Not supported | Native support with Watermark |
| End-to-End Exactly-Once | Manual implementation | Built-in framework guarantee |
| Query Optimization | None | Catalyst Optimizer |
| Latency Mode | Micro-batch | Micro-batch + Continuous Processing |
Structured Streaming Code Example
val spark = SparkSession.builder()
.appName("StructuredStreamingDemo")
.master("local[*]")
.getOrCreate()
// Read streaming from Kafka (almost same as batch API)
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "events")
.load()
// Process (exact same DataFrame operations as batch)
val wordCounts = df
.selectExpr("CAST(value AS STRING) AS word")
.groupBy("word")
.count()
// Output to console (streaming write)
wordCounts.writeStream
.outputMode("complete")
.format("console")
.start()
.awaitTermination()
EventTime and Watermark
// EventTime-based window aggregation, tolerating 10 minutes data delay
df.withWatermark("event_time", "10 minutes")
.groupBy(window($"event_time", "5 minutes"), $"user_id")
.count()
Performance Tuning Suggestions
- Parallelism: Properly configure Kafka partition count to match Spark partition count
- Memory Management: Increase Executor memory to prevent OOM, use
--executor-memory - Backpressure: Enable
spark.streaming.backpressure.enabled = trueto prevent data accumulation - Checkpoint: Configure Checkpoint directory for fault recovery
// Enable backpressure
spark.conf.set("spark.streaming.backpressure.enabled", "true")
// Configure Checkpoint (required for Structured Streaming)
query.writeStream
.option("checkpointLocation", "/path/to/checkpoint")
.start()
Selection Suggestions
- New Projects: Prefer Structured Streaming—unified API, stronger features
- Existing DStream Projects: Evaluate migration cost, gradually migrate to Structured Streaming
- Ultra-Low Latency (< 100ms): Consider Structured Streaming’s Continuous Processing mode
Structured Streaming represents the future direction of Spark stream processing. With version iterations, its features have matured—it’s the recommended choice for building production-grade real-time data pipelines.