This is article 70 in the Big Data series, comprehensively explaining Spark RDD’s three creation methods and practical usage of common Transformation operators.
SparkContext: Entry Point Object
SparkContext is the core component of Spark application, all RDD creation operations start from it. It is responsible for:
- Establishing connection with cluster manager (Standalone/YARN)
- Creating RDDs, broadcast variables, accumulators
- Submitting Jobs and coordinating execution
import org.apache.spark.{SparkConf, SparkContext}
val conf = new SparkConf()
.setMaster("spark://h121.wzk.icu:7077")
.setAppName("RDD Demo")
val sc = new SparkContext(conf)
Three RDD Creation Methods
Method 1: From Collection (parallelize / makeRDD)
Suitable for testing and small-scale data, directly parallelize in-memory Scala collection into RDD:
// parallelize: basic method
val rdd1 = sc.parallelize(List(1, 2, 3, 4, 5), numSlices = 3)
// makeRDD: wrapper for parallelize, supports specifying partition location preference
val rdd2 = sc.makeRDD(Seq("hello", "world", "spark"))
// Verify partition count
println(rdd1.getNumPartitions) // 3
numSlices controls partition count, default value is sc.defaultParallelism (usually equals total CPU cores).
Method 2: From File System (textFile)
Most common method in production, supports multiple storage backends:
// Read local file
val localRDD = sc.textFile("file:///opt/data/words.txt")
// Read HDFS file
val hdfsRDD = sc.textFile("hdfs://h121.wzk.icu:8020/data/input/")
// Read directory (automatically merges all files in directory)
val dirRDD = sc.textFile("hdfs:///data/logs/2024/")
// Specify minimum partition count (actual partition count = max(minPartitions, HDFS Block count))
val rdd = sc.textFile("hdfs:///data/big-file.csv", minPartitions = 10)
textFile reads by default line by line, each line is a string element. Read entire file (filename → content) using wholeTextFiles.
Method 3: Transform from Existing RDD
Transform one RDD to another via Transformation operations, main way to build computation pipeline (see next section).
Transformation Operators Details
Transformation is lazy - calling doesn’t trigger computation, only records transformation logic. Only when encountering Action operations (collect, count, saveAsTextFile, etc.) does Spark actually execute the entire DAG.
Basic Transformations
map — one-to-one mapping, apply function to each element:
val nums = sc.parallelize(1 to 5)
val doubled = nums.map(_ * 2)
// Result: 2, 4, 6, 8, 10
filter — filter elements by condition:
val evens = nums.filter(_ % 2 == 0)
// Result: 2, 4
flatMap — one-to-many expansion, map then flatten:
val lines = sc.parallelize(Seq("hello world", "spark rdd"))
val words = lines.flatMap(_.split(" "))
// Result: hello, world, spark, rdd
Partition-Level Operations
mapPartitions — process by partition unit, reduce function call overhead (suitable for scenarios needing database connection):
val result = rdd.mapPartitions { iter =>
// iter is iterator of all elements in current partition
iter.map(x => process(x))
}
mapPartitionsWithIndex — same as above, additionally passes partition index:
rdd.mapPartitionsWithIndex { (partIdx, iter) =>
iter.map(x => s"Partition $partIdx: $x")
}
Aggregation and Reorganization
groupBy — group by function result (causes Shuffle):
val words = sc.parallelize(Seq("apple", "banana", "avocado", "blueberry"))
val grouped = words.groupBy(_.charAt(0))
// Result: ('a', [apple, avocado]), ('b', [banana, blueberry])
distinct — deduplication (causes Shuffle):
val deduped = sc.parallelize(Seq(1, 2, 2, 3, 3, 3)).distinct()
// Result: 1, 2, 3
sortBy — sort by specified function (causes Shuffle):
val sorted = sc.parallelize(Seq(3, 1, 4, 1, 5, 9)).sortBy(x => x)
// Result: 1, 1, 3, 4, 5, 9
repartition / coalesce — adjust partition count:
// repartition: can increase or decrease partitions, always triggers Shuffle
val rdd10 = rdd.repartition(10)
// coalesce: mainly used to decrease partitions, doesn't trigger Shuffle by default (merge adjacent partitions)
val rdd3 = rdd10.coalesce(3)
Trigger Execution: Action Operations
Transformation builds computation graph, Action triggers execution and returns results:
val wordCounts = sc.textFile("hdfs:///input/")
.flatMap(_.split("\\s+"))
.map(w => (w, 1))
.reduceByKey(_ + _) // Transformation, causes Shuffle
// Below are Actions, trigger actual computation
wordCounts.collect() // Collect all results to Driver
wordCounts.count() // Count element quantity
wordCounts.take(10) // Take first 10
wordCounts.saveAsTextFile("hdfs:///output/") // Write to file
Performance Summary
| Operator | Causes Shuffle | Recommended Scenario |
|---|---|---|
map / filter / flatMap | No (narrow dependency) | Element-level transformation, prefer these |
mapPartitions | No | Batch processing needing to reuse connections/resources |
groupByKey | Yes | Try to replace with reduceByKey |
reduceByKey | Yes (local pre-aggregation) | Preferred for key-value aggregation |
repartition | Yes | Increase partition count, improve parallelism |
coalesce | No (default) | Decrease partition count, avoid small files |
Understanding whether each operator causes Shuffle is core to Spark performance tuning. Wide dependency (Shuffle) operations should be minimized or replaced with equivalent narrow dependencies.