Spark RDD Deep Dive: Five Key Features

This is article 69 in the Big Data series, deeply analyzing RDD, Spark’s core data abstraction, its five key features and design principles.

What is RDD

RDD (Resilient Distributed Dataset) is the most fundamental data abstraction in Spark framework, representing an immutable, partitionable, distributed collection of elements that can be computed in parallel.

RDD’s “resilience”体现在:

Storage resilience: Data can automatically switch between memory and disk
Fault tolerance: Automatically rebuild lost partitions through lineage
Compute resilience: Task automatically retries on failure
Sharding resilience: Can repartition as needed

All Spark advanced APIs (DataFrame, Dataset) are built on top of RDD. Understanding RDD is foundation for mastering Spark.

Five Key Features

1. Partition List

RDD consists of multiple Partitions, each partition contains a subset of the dataset, the basic unit for Spark parallel processing. Partition count determines parallelism:

When reading from HDFS, default each Block (128MB) corresponds to one partition
Can manually adjust via sc.textFile(path, numPartitions) or rdd.repartition(n)

2. Compute Function

Each RDD carries a compute function, describing how to compute current RDD’s data from parent RDD. For example:

val lines = sc.textFile("hdfs:///data/input")
val words = lines.flatMap(_.split(" "))  // Compute function: split by space
val counts = words.map(w => (w, 1))      // Compute function: construct (word, 1) key-value pairs

Compute functions use lazy evaluation - calling flatMap, map doesn’t execute immediately, only truly computes when triggering Action operations like collect(), count().

3. Dependencies

RDD records dependencies with parent RDD, divided into two types:

Narrow Dependency: Each partition of parent RDD depends on at most one partition of child RDD, no Shuffle needed. Typical operators: map, filter, union.

Parent partition 1 → Child partition 1
Parent partition 2 → Child partition 2   (one-to-one, can pipeline)

Wide Dependency (Shuffle Dependency): Each partition of parent RDD depends on multiple partitions of child RDD, needs Shuffle to transfer data across nodes. Typical operators: groupByKey, reduceByKey, join.

Parent partition 1 → Child partition 1, Child partition 2
Parent partition 2 → Child partition 1, Child partition 2  (many-to-many, needs Shuffle)

Wide dependency is Stage division boundary - Spark cuts a new Stage at each Shuffle.

4. Partitioner

For key-value pair RDD, can specify partitioner to control data distribution across nodes:

HashPartitioner (default): Partition by key.hashCode % numPartitions, suitable for most scenarios
RangePartitioner: Partition by key range, suitable for scenarios needing ordered output (e.g., sortByKey)

Correctly choosing partitioner can reduce unnecessary Shuffle, important for performance tuning.

5. Preferred Locations List

Based on data locality principle, RDD records preferred compute location for each partition. Spark scheduler tries to assign Task to node where data is located, avoiding network transfer:

PROCESS_LOCAL: Data in same JVM process (optimal)
NODE_LOCAL: Data on same node’s disk or another process
RACK_LOCAL: Data on other nodes in same rack
ANY: Data on different rack (worst, cross-rack transfer)

Lazy Evaluation and DAG Optimization

Lazy evaluation enables Spark to optimize entire computation graph before execution:

User calls series of Transformations, Spark only records logical plan
When Action triggered, Spark analyzes complete DAG
DAG divided into multiple Stages by Shuffle boundaries
Operators within same Stage merged (Pipeline) execution, reducing intermediate disk writes

Fault Tolerance: Lineage Reconstruction

RDD doesn’t need data replication for fault tolerance, but records complete transformation chain through Lineage. When partition data is lost, Spark only needs to re-execute related transformations from nearest checkpoint (or source data), without recomputing entire Job.

textFile → flatMap → map → reduceByKey → collect
    ↑                                        |
    └──────── When partition lost, rebuild from source ───────────┘

Typical Use Cases

Scenario	Why RDD is Suitable
Iterative machine learning (K-Means, PageRank)	Reuse same dataset multiple times, `cache()` in memory
Complex custom data processing logic	Bottom API provides complete control
Unstructured data processing	Supports arbitrary Scala/Java objects
Multi-stage ETL pipeline	Lazy evaluation + DAG optimization reduces intermediate writes
Key-value pair aggregation statistics	`reduceByKey`, `efficient Shuffle`

Mastering RDD’s five key features is prerequisite for deeply understanding Spark execution principles and performance tuning, also foundation for learning DataFrame/Dataset advanced APIs.