This is article 85 in the Big Data series, introducing the architecture and evolution background of Spark’s two generations of streaming frameworks.

Full illustrated version: CSDN Original | Juejin

Real-Time Computing Background

In batch processing, Spark Core’s RDD and SparkSQL are very mature. But real-time requirements are prevalent in business scenarios:

  • Real-time monitoring and alerting
  • Log streaming analysis
  • Real-time financial transaction risk control
  • Social media sentiment analysis

Spark launched two generations of streaming frameworks to solve these problems: Spark Streaming (DStream) and Structured Streaming.


Generation 1: DStream Micro-Batch

Core Concepts

DStream (Discretized Stream) is the high-level abstraction of Spark Streaming, representing continuous data stream as a series of consecutive RDDs, each containing data arriving within a specific time interval.

Supported Data Sources

  • External Sources: Kafka, Flume, HDFS/S3, Socket, Twitter API
  • Internal Sources: Transform existing DStream to generate new DStream

Supported Operations

Operation TypeExamples
Transformationmap, filter, reduceByKey, join
Output OperationsWrite to file system, database, Dashboard
Window Operationswindow, reduceByKeyAndWindow

Micro-Batch Architecture

Data Stream (Kafka/Socket/...)

  Spark Streaming Receiver

   Split by Batch Interval

   RDD (time slice 1) → Process → Output
   RDD (time slice 2) → Process → Output
   RDD (time slice 3) → Process → Output
         ...

Batch Interval Configuration

Batch Interval is the core parameter of DStream, determining the time window size for each batch processing:

ScenarioRecommended Interval
Real-time monitoring/alerting500ms ~ 1s
Log analysis5s ~ 10s
Financial transaction processing1s ~ 2s

Trade-off: Smaller intervals mean lower latency but higher system overhead; larger intervals mean higher throughput but increased latency.

DStream Code Example

import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka010._

val conf = new SparkConf().setMaster("local[2]").setAppName("StreamingDemo")
val ssc = new StreamingContext(conf, Seconds(1))

// Read from Kafka
val kafkaStream = KafkaUtils.createDirectStream[String, String](
  ssc,
  LocationStrategies.PreferConsistent,
  ConsumerStrategies.Subscribe[String, String](topics, kafkaParams)
)

// Real-time WordCount
val wordCounts = kafkaStream
  .flatMap(record => record.value().split(" "))
  .map(word => (word, 1))
  .reduceByKey(_ + _)

wordCounts.print()

ssc.start()
ssc.awaitTermination()

DStream Advantages

  • Exactly-Once Semantics: Process each record exactly once, ensuring data accuracy
  • Seamless Integration with Batch: Both use RDD under the hood, low switching cost
  • High Throughput: Can process millions of records per second
  • Automatic Fault Tolerance: Automatic recovery on failure

DStream Limitations

As business complexity increased, DStream revealed three core problems:

1. EventTime Processing Difficulty

DStream can only process by BatchTime (when data arrives at Spark), cannot perform window calculations based on EventTime (when events actually occurred).

In real scenarios, due to network delays, data may arrive out of order. DStream cannot handle this gracefully.

2. Inconsistent Batch/Stream APIs

Batch processing uses SparkSQL/DataFrame API, stream processing uses DStream API—developers need to learn and maintain two codebases.

3. End-to-End Consistency User-Responsibility

Exactly-Once semantics require users to manually manage Kafka offsets, idempotent writes, etc.—complex to implement.


Generation 2: Structured Streaming

Core Design Concept

Structured Streaming abstracts streaming data source as an Unbounded Table:

Kafka message stream → As unbounded table with continuously appended rows
SQL/DataFrame query → Operates on that unbounded table
Incremental computation → Writes results of new data to result table

This way, batch and stream processing can use the exact same SparkSQL/DataFrame API.

Improvements Over DStream

DimensionDStreamStructured Streaming
APISeparate DStream APIUnified DataFrame/SQL API
EventTime SupportNot supportedNative support with Watermark
End-to-End Exactly-OnceManual implementationBuilt-in framework guarantee
Query OptimizationNoneCatalyst Optimizer
Latency ModeMicro-batchMicro-batch + Continuous Processing

Structured Streaming Code Example

val spark = SparkSession.builder()
  .appName("StructuredStreamingDemo")
  .master("local[*]")
  .getOrCreate()

// Read streaming from Kafka (almost same as batch API)
val df = spark.readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "localhost:9092")
  .option("subscribe", "events")
  .load()

// Process (exact same DataFrame operations as batch)
val wordCounts = df
  .selectExpr("CAST(value AS STRING) AS word")
  .groupBy("word")
  .count()

// Output to console (streaming write)
wordCounts.writeStream
  .outputMode("complete")
  .format("console")
  .start()
  .awaitTermination()

EventTime and Watermark

// EventTime-based window aggregation, tolerating 10 minutes data delay
df.withWatermark("event_time", "10 minutes")
  .groupBy(window($"event_time", "5 minutes"), $"user_id")
  .count()

Performance Tuning Suggestions

  1. Parallelism: Properly configure Kafka partition count to match Spark partition count
  2. Memory Management: Increase Executor memory to prevent OOM, use --executor-memory
  3. Backpressure: Enable spark.streaming.backpressure.enabled = true to prevent data accumulation
  4. Checkpoint: Configure Checkpoint directory for fault recovery
// Enable backpressure
spark.conf.set("spark.streaming.backpressure.enabled", "true")

// Configure Checkpoint (required for Structured Streaming)
query.writeStream
  .option("checkpointLocation", "/path/to/checkpoint")
  .start()

Selection Suggestions

  • New Projects: Prefer Structured Streaming—unified API, stronger features
  • Existing DStream Projects: Evaluate migration cost, gradually migrate to Structured Streaming
  • Ultra-Low Latency (< 100ms): Consider Structured Streaming’s Continuous Processing mode

Structured Streaming represents the future direction of Spark stream processing. With version iterations, its features have matured—it’s the recommended choice for building production-grade real-time data pipelines.