Spark RDD Creation & Transformation Operations

This is article 70 in the Big Data series, comprehensively explaining Spark RDD’s three creation methods and practical usage of common Transformation operators.

SparkContext: Entry Point Object

SparkContext is the core component of Spark application, all RDD creation operations start from it. It is responsible for:

Establishing connection with cluster manager (Standalone/YARN)
Creating RDDs, broadcast variables, accumulators
Submitting Jobs and coordinating execution

import org.apache.spark.{SparkConf, SparkContext}

val conf = new SparkConf()
  .setMaster("spark://h121.wzk.icu:7077")
  .setAppName("RDD Demo")
val sc = new SparkContext(conf)

Three RDD Creation Methods

Method 1: From Collection (parallelize / makeRDD)

Suitable for testing and small-scale data, directly parallelize in-memory Scala collection into RDD:

// parallelize: basic method
val rdd1 = sc.parallelize(List(1, 2, 3, 4, 5), numSlices = 3)

// makeRDD: wrapper for parallelize, supports specifying partition location preference
val rdd2 = sc.makeRDD(Seq("hello", "world", "spark"))

// Verify partition count
println(rdd1.getNumPartitions)  // 3

numSlices controls partition count, default value is sc.defaultParallelism (usually equals total CPU cores).

Method 2: From File System (textFile)

Most common method in production, supports multiple storage backends:

// Read local file
val localRDD = sc.textFile("file:///opt/data/words.txt")

// Read HDFS file
val hdfsRDD = sc.textFile("hdfs://h121.wzk.icu:8020/data/input/")

// Read directory (automatically merges all files in directory)
val dirRDD = sc.textFile("hdfs:///data/logs/2024/")

// Specify minimum partition count (actual partition count = max(minPartitions, HDFS Block count))
val rdd = sc.textFile("hdfs:///data/big-file.csv", minPartitions = 10)

textFile reads by default line by line, each line is a string element. Read entire file (filename → content) using wholeTextFiles.

Method 3: Transform from Existing RDD

Transform one RDD to another via Transformation operations, main way to build computation pipeline (see next section).

Transformation Operators Details

Transformation is lazy - calling doesn’t trigger computation, only records transformation logic. Only when encountering Action operations (collect, count, saveAsTextFile, etc.) does Spark actually execute the entire DAG.

Basic Transformations

map — one-to-one mapping, apply function to each element:

val nums = sc.parallelize(1 to 5)
val doubled = nums.map(_ * 2)
// Result: 2, 4, 6, 8, 10

filter — filter elements by condition:

val evens = nums.filter(_ % 2 == 0)
// Result: 2, 4

flatMap — one-to-many expansion, map then flatten:

val lines = sc.parallelize(Seq("hello world", "spark rdd"))
val words = lines.flatMap(_.split(" "))
// Result: hello, world, spark, rdd

Partition-Level Operations

mapPartitions — process by partition unit, reduce function call overhead (suitable for scenarios needing database connection):

val result = rdd.mapPartitions { iter =>
  // iter is iterator of all elements in current partition
  iter.map(x => process(x))
}

mapPartitionsWithIndex — same as above, additionally passes partition index:

rdd.mapPartitionsWithIndex { (partIdx, iter) =>
  iter.map(x => s"Partition $partIdx: $x")
}

Aggregation and Reorganization

groupBy — group by function result (causes Shuffle):

val words = sc.parallelize(Seq("apple", "banana", "avocado", "blueberry"))
val grouped = words.groupBy(_.charAt(0))
// Result: ('a', [apple, avocado]), ('b', [banana, blueberry])

distinct — deduplication (causes Shuffle):

val deduped = sc.parallelize(Seq(1, 2, 2, 3, 3, 3)).distinct()
// Result: 1, 2, 3

sortBy — sort by specified function (causes Shuffle):

val sorted = sc.parallelize(Seq(3, 1, 4, 1, 5, 9)).sortBy(x => x)
// Result: 1, 1, 3, 4, 5, 9

repartition / coalesce — adjust partition count:

// repartition: can increase or decrease partitions, always triggers Shuffle
val rdd10 = rdd.repartition(10)

// coalesce: mainly used to decrease partitions, doesn't trigger Shuffle by default (merge adjacent partitions)
val rdd3 = rdd10.coalesce(3)

Trigger Execution: Action Operations

Transformation builds computation graph, Action triggers execution and returns results:

val wordCounts = sc.textFile("hdfs:///input/")
  .flatMap(_.split("\\s+"))
  .map(w => (w, 1))
  .reduceByKey(_ + _)     // Transformation, causes Shuffle

// Below are Actions, trigger actual computation
wordCounts.collect()           // Collect all results to Driver
wordCounts.count()             // Count element quantity
wordCounts.take(10)            // Take first 10
wordCounts.saveAsTextFile("hdfs:///output/")  // Write to file

Performance Summary

Operator	Causes Shuffle	Recommended Scenario
`map` / `filter` / `flatMap`	No (narrow dependency)	Element-level transformation, prefer these
`mapPartitions`	No	Batch processing needing to reuse connections/resources
`groupByKey`	Yes	Try to replace with `reduceByKey`
`reduceByKey`	Yes (local pre-aggregation)	Preferred for key-value aggregation
`repartition`	Yes	Increase partition count, improve parallelism
`coalesce`	No (default)	Decrease partition count, avoid small files

Understanding whether each operator causes Shuffle is core to Spark performance tuning. Wide dependency (Shuffle) operations should be minimized or replaced with equivalent narrow dependencies.