This is article 77 in the Big Data series, deeply explaining Spark RDD fault tolerance mechanism, focusing on Checkpoint principle, use scenarios and precautions.
Two Paths of RDD Fault Tolerance
Spark provides basic fault tolerance through Lineage recomputation — when a partition is lost, can trace back along dependency chain and recompute. But for RDDs with extremely long dependency chains or extremely expensive computation, recomputation cost is too high. This is when Checkpoint is needed to cut dependency chain and persist data.
Checkpoint Execution Flow
// 1. Configure checkpoint directory (typically on HDFS)
sc.setCheckpointDir("hdfs://namenode:8020/spark-checkpoint")
// 2. Mark RDD needing checkpoint
val rdd = sc.textFile("hdfs://data/input")
.flatMap(_.split(" "))
.map((_, 1))
.reduceByKey(_ + _)
rdd.checkpoint()
// 3. Trigger Action — Spark will additionally submit a Job to write data to HDFS
rdd.count()
Checkpoint execution is two-phase:
- When calling
checkpoint(), only registers mark, doesn’t execute immediately; - After first Action completes, Spark submits independent checkpoint Job, writes RDD data to configured directory.
Therefore, RDDs to be checkpointed are recommended to first cache(), avoiding recomputing original data just for writing checkpoint.
rdd.cache()
rdd.checkpoint()
rdd.count() // Compute from cache first, then write cache result to checkpoint
Checkpoint vs Persist/Cache
| Dimension | persist / cache | checkpoint |
|---|---|---|
| Storage location | Executor memory / local disk | Reliable distributed storage (HDFS) |
| Lifecycle | Auto cleanup after application ends | Persist retained, need manual delete |
| Dependency chain | Retain complete lineage | Truncate dependency chain, restart from checkpoint RDD |
| Node failure | Data lost after Executor crash | Not affected by single node failure |
| Applicable scenario | Frequently reused intermediate results | Long dependency chains, iterative algorithms |
Applicable Scenarios
1. Iterative Algorithms (PageRank, K-Means, etc.)
Each iteration produces new RDD, dependency chain grows linearly with iteration count. Suggest doing checkpoint on core RDD every few rounds to prevent Driver OOM or task serialization failure from overly long dependency chain.
2. Long Lineage Pipeline Jobs
When ETL pipeline goes through dozens of transformations, any step failure needs recomputation from beginning. Landing key checkpoints to HDFS can significantly shorten failure recovery time.
3. Multi-Job Shared Intermediate Results
After checkpointing time-consuming preprocessing results, multiple downstream Jobs can directly read without recomputation.
RDD Partitioning Strategy
Article also introduces two commonly used partitioners:
HashPartitioner (default): Hash key then modulo by partition count, suitable for most scenarios. Same key guaranteed to fall in same partition.
rdd.partitionBy(new HashPartitioner(100))
RangePartitioner: Based on reservoir sampling to estimate data distribution, evenly distribute data to partitions by value range. Suitable for scenarios needing ordered output or range queries, sortByKey internally uses this partitioner.
Verify Checkpoint Takes Effect
println(rdd.isCheckpointed) // false (before Action)
rdd.count()
println(rdd.isCheckpointed) // true (after Action)
println(rdd.getCheckpointFile) // Output HDFS path
println(rdd.toDebugString) // Dependency chain starts from checkpoint RDD
Performance Trade-offs
- Advantages: Reliable fault tolerance, truncate overly long dependency chain, support cross-application reuse
- Cost: Extra HDFS I/O overhead, need more storage space, checkpoint Job increases execution time
Production suggestion: Force enable checkpoint for pipelines running over 1 hour and iterative algorithms, regularly clean checkpoint directory to avoid storage bloat.