I. From Batch Processing to In-Memory Computing

Batch Processing Era (2006-2012)

  • MapReduce framework dominated
  • Typical scenarios: overnight ETL, log analysis, data cleaning
  • Performance bottlenecks: intermediate results require disk writes, high task scheduling overhead

Spark Revolution (2013)

  • Innovation: RDD in-memory computing, DAG execution plan, multi-language support
  • Performance improvement:
    • Iterative algorithms: 100x faster
    • Interactive queries: 10-100x faster
    • Batch jobs: 10-30x faster

Ecosystem Evolution

  • Cloudera Impala (2013): First open-source MPP SQL engine
  • Facebook Presto (2013): Supports multiple data sources
  • Apache Drill (2015): Supports semi-structured data

II. From Offline to Real-time Computing

Offline Computing Era

  • T+1 mode: process previous day’s data on the current day
  • Applicable: daily reports, historical analysis, ML model training

Rise of Real-time Stream Processing

  • Apache Storm: Sub-second latency, but only supports “at-most-once” semantics
  • Lambda architecture: Batch layer + speed layer + serving layer (requires maintaining two codebases)

Next-Generation Stream Processing

TechnologyCharacteristicsLatency
Spark StreamingMicro-batching, exactly-onceSeconds
Apache FlinkNative event-driven, millisecondMilliseconds