Spark Standalone Mode: Architecture & Performance Tuning

This is article 79 in the Big Data series, systematically reviewing Spark Standalone deployment mode architecture design, job submission flow and performance optimization strategies.

Standalone Cluster Four Core Components

Driver

User program’s “brain”, responsible for:

Parse user code into DAG
Divide into Stages, generate Task list
Apply for resources from Master
Monitor Task execution status, handle failure retries

Master

Cluster resource manager, default listening port 7077, Web UI port 8080:

Maintain global resource view of cluster
Accept application registration requests, allocate Worker resources by strategy
Continuously monitor Worker heartbeats, handle node failures

Worker

Each physical node runs one Worker process:

Register with Master at startup and report available CPU cores and memory
Start/stop Executor processes based on Master instructions
Send heartbeats to Master periodically

Executor

Actual computation process, runs on Worker nodes:

Maintain thread pool, execute multiple Tasks in parallel
Manage RDD data caching on this node (BlockManager)
Report Task completion status to Driver

Application Submission Flow

User code (main method)
  │
  ▼
① Initialize SparkContext
  │
  ▼
② Driver registers with Master, describing required resources (core count, memory size)
  │
  ▼
③ Master validates cluster capacity, selects suitable Worker nodes
  │
  ▼
④ Master notifies Worker to start Executor process
  │
  ▼
⑤ After Executor starts, **reverse registers** with Driver
  │
  ▼
⑥ Driver serializes Tasks and distributes to corresponding Executors for execution

SparkContext Internal Architecture

SparkContext is the core object on Driver side, contains three key subsystems:

Component	Responsibility
DAGScheduler	Split RDD DAG into Stages by wide dependency, submit Stages to TaskScheduler
TaskScheduler	Receive TaskSet from Stage, distribute Tasks by resource and locality priority
SchedulerBackend	Communicate with Executors, handle registration, status reporting and resource recovery

Shuffle Evolution History

Spark’s Shuffle implementation went through three generations:

Hash-Based Shuffle V1 (Early)

Each Map Task creates a file for each Reduce Task
File count = Map Task count × Reduce Task count, easily generates hundreds of thousands of small files
Huge pressure on disk and file system

Hash-Based Shuffle V2 (File Consolidation)

Multiple Map Tasks on same Executor merge writes to same group of files
File count reduced to Executor count × Reduce Task count
Still has random write disk problem

Sort-Based Shuffle (Default since Spark 1.2)

Each Map Task only writes one data file + one index file
Data written in order sorted by partition ID, disk I/O efficiency greatly improved
File count reduced to Map Task count × 2, currently optimal implementation

RDD Optimization Strategies

1. Avoid recreating same RDD

Executing textFile on same data source twice triggers two reads, should reuse same RDD reference.

2. Reasonably cache and reuse RDD

RDDs used by multiple Actions or multiple collect should be cache()d in advance to avoid recomputation.

3. Filter early

Execute filter() as early as possible in transformation chain to reduce data volume processed by subsequent operators.

4. Choose correct aggregation operator

// Inefficient: groupByKey transfers all values then aggregates, large shuffle data
rdd.groupByKey().mapValues(_.sum)

// Efficient: reduceByKey pre-aggregates on Map side, significantly reduces shuffle data
rdd.reduceByKey(_ + _)

5. Broadcast large variables

Shared variables exceeding several MB (dictionary tables, model parameters) should use broadcast variables to avoid repeated transmission with each Task.

Submission Parameters Reference

spark-submit \
  --master spark://master:7077 \
  --deploy-mode client \          # client: Driver local; cluster: Driver in cluster
  --num-executors 4 \
  --executor-cores 2 \
  --executor-memory 4g \
  --driver-memory 2g \
  --class com.example.MyApp \
  myapp.jar