This is article 67 in the Big Data series, systematically reviewing three generations of big data computing engine evolution, focusing on Spark’s design philosophy and core components.
Three Generations of Computing Engines
Big data processing technology has gone through three main stages:
| Generation | Representative Tech | Positioning |
|---|---|---|
| First | MapReduce | Batch processing, disk-driven |
| Second | Spark | In-memory computing, unified batch/stream |
| Third | Flink | True stream processing, event-driven |
Spark was born in 2009 at UC Berkeley AMPLab, became Apache top-level project in 2014, currently dominates domestic production environments.
MapReduce Limitations
MapReduce pioneered distributed batch processing, but as business complexity grew, its shortcomings gradually emerged:
- High disk I/O overhead: Each MapReduce job’s intermediate results must be written to HDFS, frequent disk read/write in multi-step ETL scenarios
- Limited expressiveness: Only Map and Reduce two phases, complex logic needs multiple Jobs executed in series
- High latency: Each Job needs to re-apply for resources, start JVM, minute-level latency can’t meet interactive query needs
- No stream processing support: Naturally batch processing model, can’t process real-time data streams
Spark Core Advantages
In-Memory Computing
Spark’s core innovation is storing intermediate computation results in memory, rather than writing to disk each time. In iterative algorithms (machine learning, graph computing), can achieve 100x+ speed improvement compared to MapReduce; even in batch processing scenarios, typically about 10x speedup.
Unified Computing Engine
Spark provides unified programming model, one framework covers multiple scenarios:
- Spark Core: Basic RDD abstraction and task scheduling
- Spark SQL: Structured data query, Hive compatible
- Spark Streaming / Structured Streaming: Unified batch/stream processing
- MLlib: Built-in 80+ machine learning algorithms
- GraphX: Graph computing framework
This means teams only need to master one set of APIs, no need to switch between different frameworks.
Multi-Language Support
Officially supports Scala, Java, Python, R four languages, provides REPL interactive environment, lowering门槛 for data scientists.
Core Concepts Overview
Understanding Spark requires mastering these key terms:
| Concept | Description |
|---|---|
| Application | User-submitted Spark program |
| Driver Program | Process running main() function, creating SparkContext |
| Executor | Process running tasks on Worker nodes |
| Job | Computation job triggered by one Action operation |
| Stage | Computation stage divided by Shuffle boundaries in Job |
| Task | Smallest execution unit assigned to single partition within Stage |
One Application contains multiple Jobs, one Job contains multiple Stages, one Stage contains multiple Tasks. Task is the smallest granularity actually running on Executor.
Deployment Modes
Spark supports three main deployment methods:
- Standalone: Spark’s built-in resource management, suitable for independent clusters
- YARN: Integrated with Hadoop ecosystem, most mainstream in domestic production
- Kubernetes: Containerized deployment, cloud-native trend
Driver and Executor relationship: Driver parses program, generates execution plan, schedules Tasks; Executor receives Tasks, executes computation, returns results to Driver.
Core Position of RDD
RDD (Resilient Distributed Dataset) is Spark’s most fundamental data abstraction. All advanced APIs (DataFrame, Dataset) are built on top of RDD. RDD has two types of operations:
- Transformation: Returns new RDD, lazy execution, e.g.,
map,filter,groupByKey - Action: Triggers actual computation, returns result or writes data, e.g.,
collect,count,saveAsTextFile
Lazy evaluation is key to Spark’s execution plan optimization - before encountering Action, Spark only records transformation logic, doesn’t do actual computation, enabling global optimization of entire DAG.