This is article 79 in the Big Data series, systematically reviewing Spark Standalone deployment mode architecture design, job submission flow and performance optimization strategies.
Standalone Cluster Four Core Components
Driver
User program’s “brain”, responsible for:
- Parse user code into DAG
- Divide into Stages, generate Task list
- Apply for resources from Master
- Monitor Task execution status, handle failure retries
Master
Cluster resource manager, default listening port 7077, Web UI port 8080:
- Maintain global resource view of cluster
- Accept application registration requests, allocate Worker resources by strategy
- Continuously monitor Worker heartbeats, handle node failures
Worker
Each physical node runs one Worker process:
- Register with Master at startup and report available CPU cores and memory
- Start/stop Executor processes based on Master instructions
- Send heartbeats to Master periodically
Executor
Actual computation process, runs on Worker nodes:
- Maintain thread pool, execute multiple Tasks in parallel
- Manage RDD data caching on this node (BlockManager)
- Report Task completion status to Driver
Application Submission Flow
User code (main method)
│
▼
① Initialize SparkContext
│
▼
② Driver registers with Master, describing required resources (core count, memory size)
│
▼
③ Master validates cluster capacity, selects suitable Worker nodes
│
▼
④ Master notifies Worker to start Executor process
│
▼
⑤ After Executor starts, **reverse registers** with Driver
│
▼
⑥ Driver serializes Tasks and distributes to corresponding Executors for execution
SparkContext Internal Architecture
SparkContext is the core object on Driver side, contains three key subsystems:
| Component | Responsibility |
|---|---|
| DAGScheduler | Split RDD DAG into Stages by wide dependency, submit Stages to TaskScheduler |
| TaskScheduler | Receive TaskSet from Stage, distribute Tasks by resource and locality priority |
| SchedulerBackend | Communicate with Executors, handle registration, status reporting and resource recovery |
Shuffle Evolution History
Spark’s Shuffle implementation went through three generations:
Hash-Based Shuffle V1 (Early)
- Each Map Task creates a file for each Reduce Task
- File count = Map Task count × Reduce Task count, easily generates hundreds of thousands of small files
- Huge pressure on disk and file system
Hash-Based Shuffle V2 (File Consolidation)
- Multiple Map Tasks on same Executor merge writes to same group of files
- File count reduced to Executor count × Reduce Task count
- Still has random write disk problem
Sort-Based Shuffle (Default since Spark 1.2)
- Each Map Task only writes one data file + one index file
- Data written in order sorted by partition ID, disk I/O efficiency greatly improved
- File count reduced to Map Task count × 2, currently optimal implementation
RDD Optimization Strategies
1. Avoid recreating same RDD
Executing textFile on same data source twice triggers two reads, should reuse same RDD reference.
2. Reasonably cache and reuse RDD
RDDs used by multiple Actions or multiple collect should be cache()d in advance to avoid recomputation.
3. Filter early
Execute filter() as early as possible in transformation chain to reduce data volume processed by subsequent operators.
4. Choose correct aggregation operator
// Inefficient: groupByKey transfers all values then aggregates, large shuffle data
rdd.groupByKey().mapValues(_.sum)
// Efficient: reduceByKey pre-aggregates on Map side, significantly reduces shuffle data
rdd.reduceByKey(_ + _)
5. Broadcast large variables
Shared variables exceeding several MB (dictionary tables, model parameters) should use broadcast variables to avoid repeated transmission with each Task.
Submission Parameters Reference
spark-submit \
--master spark://master:7077 \
--deploy-mode client \ # client: Driver local; cluster: Driver in cluster
--num-executors 4 \
--executor-cores 2 \
--executor-memory 4g \
--driver-memory 2g \
--class com.example.MyApp \
myapp.jar