I. From Batch Processing to In-Memory Computing
Batch Processing Era (2006-2012)
- MapReduce framework dominated
- Typical scenarios: overnight ETL, log analysis, data cleaning
- Performance bottlenecks: intermediate results require disk writes, high task scheduling overhead
Spark Revolution (2013)
- Innovation: RDD in-memory computing, DAG execution plan, multi-language support
- Performance improvement:
- Iterative algorithms: 100x faster
- Interactive queries: 10-100x faster
- Batch jobs: 10-30x faster
Ecosystem Evolution
- Cloudera Impala (2013): First open-source MPP SQL engine
- Facebook Presto (2013): Supports multiple data sources
- Apache Drill (2015): Supports semi-structured data
II. From Offline to Real-time Computing
Offline Computing Era
- T+1 mode: process previous day’s data on the current day
- Applicable: daily reports, historical analysis, ML model training
Rise of Real-time Stream Processing
- Apache Storm: Sub-second latency, but only supports “at-most-once” semantics
- Lambda architecture: Batch layer + speed layer + serving layer (requires maintaining two codebases)
Next-Generation Stream Processing
| Technology | Characteristics | Latency |
|---|---|---|
| Spark Streaming | Micro-batching, exactly-once | Seconds |
| Apache Flink | Native event-driven, millisecond | Milliseconds |