This is article 67 in the Big Data series, systematically reviewing three generations of big data computing engine evolution, focusing on Spark’s design philosophy and core components.

Three Generations of Computing Engines

Big data processing technology has gone through three main stages:

GenerationRepresentative TechPositioning
FirstMapReduceBatch processing, disk-driven
SecondSparkIn-memory computing, unified batch/stream
ThirdFlinkTrue stream processing, event-driven

Spark was born in 2009 at UC Berkeley AMPLab, became Apache top-level project in 2014, currently dominates domestic production environments.

MapReduce Limitations

MapReduce pioneered distributed batch processing, but as business complexity grew, its shortcomings gradually emerged:

  • High disk I/O overhead: Each MapReduce job’s intermediate results must be written to HDFS, frequent disk read/write in multi-step ETL scenarios
  • Limited expressiveness: Only Map and Reduce two phases, complex logic needs multiple Jobs executed in series
  • High latency: Each Job needs to re-apply for resources, start JVM, minute-level latency can’t meet interactive query needs
  • No stream processing support: Naturally batch processing model, can’t process real-time data streams

Spark Core Advantages

In-Memory Computing

Spark’s core innovation is storing intermediate computation results in memory, rather than writing to disk each time. In iterative algorithms (machine learning, graph computing), can achieve 100x+ speed improvement compared to MapReduce; even in batch processing scenarios, typically about 10x speedup.

Unified Computing Engine

Spark provides unified programming model, one framework covers multiple scenarios:

  • Spark Core: Basic RDD abstraction and task scheduling
  • Spark SQL: Structured data query, Hive compatible
  • Spark Streaming / Structured Streaming: Unified batch/stream processing
  • MLlib: Built-in 80+ machine learning algorithms
  • GraphX: Graph computing framework

This means teams only need to master one set of APIs, no need to switch between different frameworks.

Multi-Language Support

Officially supports Scala, Java, Python, R four languages, provides REPL interactive environment, lowering门槛 for data scientists.

Core Concepts Overview

Understanding Spark requires mastering these key terms:

ConceptDescription
ApplicationUser-submitted Spark program
Driver ProgramProcess running main() function, creating SparkContext
ExecutorProcess running tasks on Worker nodes
JobComputation job triggered by one Action operation
StageComputation stage divided by Shuffle boundaries in Job
TaskSmallest execution unit assigned to single partition within Stage

One Application contains multiple Jobs, one Job contains multiple Stages, one Stage contains multiple Tasks. Task is the smallest granularity actually running on Executor.

Deployment Modes

Spark supports three main deployment methods:

  • Standalone: Spark’s built-in resource management, suitable for independent clusters
  • YARN: Integrated with Hadoop ecosystem, most mainstream in domestic production
  • Kubernetes: Containerized deployment, cloud-native trend

Driver and Executor relationship: Driver parses program, generates execution plan, schedules Tasks; Executor receives Tasks, executes computation, returns results to Driver.

Core Position of RDD

RDD (Resilient Distributed Dataset) is Spark’s most fundamental data abstraction. All advanced APIs (DataFrame, Dataset) are built on top of RDD. RDD has two types of operations:

  • Transformation: Returns new RDD, lazy execution, e.g., map, filter, groupByKey
  • Action: Triggers actual computation, returns result or writes data, e.g., collect, count, saveAsTextFile

Lazy evaluation is key to Spark’s execution plan optimization - before encountering Action, Spark only records transformation logic, doesn’t do actual computation, enabling global optimization of entire DAG.