Spark Streaming Introduction: From DStream to Structured ...

This is article 85 in the Big Data series, introducing the architecture and evolution background of Spark’s two generations of streaming frameworks.

Full illustrated version: CSDN Original | Juejin

Real-Time Computing Background

In batch processing, Spark Core’s RDD and SparkSQL are very mature. But real-time requirements are prevalent in business scenarios:

Real-time monitoring and alerting
Log streaming analysis
Real-time financial transaction risk control
Social media sentiment analysis

Spark launched two generations of streaming frameworks to solve these problems: Spark Streaming (DStream) and Structured Streaming.

Generation 1: DStream Micro-Batch

Core Concepts

DStream (Discretized Stream) is the high-level abstraction of Spark Streaming, representing continuous data stream as a series of consecutive RDDs, each containing data arriving within a specific time interval.

Supported Data Sources

External Sources: Kafka, Flume, HDFS/S3, Socket, Twitter API
Internal Sources: Transform existing DStream to generate new DStream

Supported Operations

Operation Type	Examples
Transformation	`map`, `filter`, `reduceByKey`, `join`
Output Operations	Write to file system, database, Dashboard
Window Operations	`window`, `reduceByKeyAndWindow`

Micro-Batch Architecture

Data Stream (Kafka/Socket/...)
         ↓
  Spark Streaming Receiver
         ↓
   Split by Batch Interval
         ↓
   RDD (time slice 1) → Process → Output
   RDD (time slice 2) → Process → Output
   RDD (time slice 3) → Process → Output
         ...

Batch Interval Configuration

Batch Interval is the core parameter of DStream, determining the time window size for each batch processing:

Scenario	Recommended Interval
Real-time monitoring/alerting	500ms ~ 1s
Log analysis	5s ~ 10s
Financial transaction processing	1s ~ 2s

Trade-off: Smaller intervals mean lower latency but higher system overhead; larger intervals mean higher throughput but increased latency.

DStream Code Example

import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka010._

val conf = new SparkConf().setMaster("local[2]").setAppName("StreamingDemo")
val ssc = new StreamingContext(conf, Seconds(1))

// Read from Kafka
val kafkaStream = KafkaUtils.createDirectStream[String, String](
  ssc,
  LocationStrategies.PreferConsistent,
  ConsumerStrategies.Subscribe[String, String](topics, kafkaParams)
)

// Real-time WordCount
val wordCounts = kafkaStream
  .flatMap(record => record.value().split(" "))
  .map(word => (word, 1))
  .reduceByKey(_ + _)

wordCounts.print()

ssc.start()
ssc.awaitTermination()

DStream Advantages

Exactly-Once Semantics: Process each record exactly once, ensuring data accuracy
Seamless Integration with Batch: Both use RDD under the hood, low switching cost
High Throughput: Can process millions of records per second
Automatic Fault Tolerance: Automatic recovery on failure

DStream Limitations

As business complexity increased, DStream revealed three core problems:

1. EventTime Processing Difficulty

DStream can only process by BatchTime (when data arrives at Spark), cannot perform window calculations based on EventTime (when events actually occurred).

In real scenarios, due to network delays, data may arrive out of order. DStream cannot handle this gracefully.

2. Inconsistent Batch/Stream APIs

Batch processing uses SparkSQL/DataFrame API, stream processing uses DStream API—developers need to learn and maintain two codebases.

3. End-to-End Consistency User-Responsibility

Exactly-Once semantics require users to manually manage Kafka offsets, idempotent writes, etc.—complex to implement.

Generation 2: Structured Streaming

Core Design Concept

Structured Streaming abstracts streaming data source as an Unbounded Table:

Kafka message stream → As unbounded table with continuously appended rows
SQL/DataFrame query → Operates on that unbounded table
Incremental computation → Writes results of new data to result table

This way, batch and stream processing can use the exact same SparkSQL/DataFrame API.

Improvements Over DStream

Dimension	DStream	Structured Streaming
API	Separate DStream API	Unified DataFrame/SQL API
EventTime Support	Not supported	Native support with Watermark
End-to-End Exactly-Once	Manual implementation	Built-in framework guarantee
Query Optimization	None	Catalyst Optimizer
Latency Mode	Micro-batch	Micro-batch + Continuous Processing

Structured Streaming Code Example

val spark = SparkSession.builder()
  .appName("StructuredStreamingDemo")
  .master("local[*]")
  .getOrCreate()

// Read streaming from Kafka (almost same as batch API)
val df = spark.readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "localhost:9092")
  .option("subscribe", "events")
  .load()

// Process (exact same DataFrame operations as batch)
val wordCounts = df
  .selectExpr("CAST(value AS STRING) AS word")
  .groupBy("word")
  .count()

// Output to console (streaming write)
wordCounts.writeStream
  .outputMode("complete")
  .format("console")
  .start()
  .awaitTermination()

EventTime and Watermark

// EventTime-based window aggregation, tolerating 10 minutes data delay
df.withWatermark("event_time", "10 minutes")
  .groupBy(window($"event_time", "5 minutes"), $"user_id")
  .count()

Performance Tuning Suggestions

Parallelism: Properly configure Kafka partition count to match Spark partition count
Memory Management: Increase Executor memory to prevent OOM, use --executor-memory
Backpressure: Enable spark.streaming.backpressure.enabled = true to prevent data accumulation
Checkpoint: Configure Checkpoint directory for fault recovery

// Enable backpressure
spark.conf.set("spark.streaming.backpressure.enabled", "true")

// Configure Checkpoint (required for Structured Streaming)
query.writeStream
  .option("checkpointLocation", "/path/to/checkpoint")
  .start()

Selection Suggestions

New Projects: Prefer Structured Streaming—unified API, stronger features
Existing DStream Projects: Evaluate migration cost, gradually migrate to Structured Streaming
Ultra-Low Latency (< 100ms): Consider Structured Streaming’s Continuous Processing mode

Structured Streaming represents the future direction of Spark stream processing. With version iterations, its features have matured—it’s the recommended choice for building production-grade real-time data pipelines.