From MapReduce to Spark: Big Data Computing Evolution

This is article 67 in the Big Data series, systematically reviewing three generations of big data computing engine evolution, focusing on Spark’s design philosophy and core components.

Three Generations of Computing Engines

Big data processing technology has gone through three main stages:

Generation	Representative Tech	Positioning
First	MapReduce	Batch processing, disk-driven
Second	Spark	In-memory computing, unified batch/stream
Third	Flink	True stream processing, event-driven

Spark was born in 2009 at UC Berkeley AMPLab, became Apache top-level project in 2014, currently dominates domestic production environments.

MapReduce Limitations

MapReduce pioneered distributed batch processing, but as business complexity grew, its shortcomings gradually emerged:

High disk I/O overhead: Each MapReduce job’s intermediate results must be written to HDFS, frequent disk read/write in multi-step ETL scenarios
Limited expressiveness: Only Map and Reduce two phases, complex logic needs multiple Jobs executed in series
High latency: Each Job needs to re-apply for resources, start JVM, minute-level latency can’t meet interactive query needs
No stream processing support: Naturally batch processing model, can’t process real-time data streams

Spark Core Advantages

In-Memory Computing

Spark’s core innovation is storing intermediate computation results in memory, rather than writing to disk each time. In iterative algorithms (machine learning, graph computing), can achieve 100x+ speed improvement compared to MapReduce; even in batch processing scenarios, typically about 10x speedup.

Unified Computing Engine

Spark provides unified programming model, one framework covers multiple scenarios:

Spark Core: Basic RDD abstraction and task scheduling
Spark SQL: Structured data query, Hive compatible
Spark Streaming / Structured Streaming: Unified batch/stream processing
MLlib: Built-in 80+ machine learning algorithms
GraphX: Graph computing framework

This means teams only need to master one set of APIs, no need to switch between different frameworks.

Multi-Language Support

Officially supports Scala, Java, Python, R four languages, provides REPL interactive environment, lowering门槛 for data scientists.

Core Concepts Overview

Understanding Spark requires mastering these key terms:

Concept	Description
Application	User-submitted Spark program
Driver Program	Process running main() function, creating SparkContext
Executor	Process running tasks on Worker nodes
Job	Computation job triggered by one Action operation
Stage	Computation stage divided by Shuffle boundaries in Job
Task	Smallest execution unit assigned to single partition within Stage

One Application contains multiple Jobs, one Job contains multiple Stages, one Stage contains multiple Tasks. Task is the smallest granularity actually running on Executor.

Deployment Modes

Spark supports three main deployment methods:

Standalone: Spark’s built-in resource management, suitable for independent clusters
YARN: Integrated with Hadoop ecosystem, most mainstream in domestic production
Kubernetes: Containerized deployment, cloud-native trend

Driver and Executor relationship: Driver parses program, generates execution plan, schedules Tasks; Executor receives Tasks, executes computation, returns results to Driver.

Core Position of RDD

RDD (Resilient Distributed Dataset) is Spark’s most fundamental data abstraction. All advanced APIs (DataFrame, Dataset) are built on top of RDD. RDD has two types of operations:

Transformation: Returns new RDD, lazy execution, e.g., map, filter, groupByKey
Action: Triggers actual computation, returns result or writes data, e.g., collect, count, saveAsTextFile

Lazy evaluation is key to Spark’s execution plan optimization - before encountering Action, Spark only records transformation logic, doesn’t do actual computation, enabling global optimization of entire DAG.