This is article 81 in the Big Data series, comprehensively introducing Spark’s three core data abstractions’ features, use cases and mutual conversions.

SparkSession: Unified Entry Point

Before Spark 2.0, developers needed to use multiple entry points:

  • SQLContext: Entry point for DataFrame creation and SQL execution
  • HiveContext: Inherits from SQLContext, supports HiveQL syntax

After Spark 2.0, SparkSession unifies the above entry points, while holding reference to SparkContext:

val spark = SparkSession.builder()
  .appName("Demo")
  .master("local[*]")
  .getOrCreate()

// Get SparkContext through SparkSession
val sc = spark.sparkContext

RDD (Resilient Distributed Dataset)

RDD is Spark’s lowest-level data abstraction, with following core features:

Immutability

Operations like map(), filter() generate new RDDs instead of modifying original data, which simplifies fault tolerance and enables safe parallel processing.

Fault Tolerance

  • Lineage tracking: Records transformation sequence, can recompute from any ancestor RDD
  • Checkpoint: Can persist RDD to stable storage, truncate lineage chain
  • Partition recovery: Data distributed across cluster nodes, supports local recovery

Lazy Evaluation

  • Transformation (like map, filter, reduceByKey): Only records logic, doesn’t trigger computation
  • Action (like count, collect, saveAsTextFile): Triggers actual computation

Creation Methods

// From collection
val rdd = sc.parallelize(Array(1, 2, 3, 4, 5))

// From external storage
val rdd = sc.textFile("hdfs://path/to/file")

// Transform from existing RDD
val newRdd = rdd.map(x => x * 2)

Use Cases

  • Unstructured data processing
  • Iterative machine learning algorithms
  • ETL pipelines, graph computing (GraphX)

DataFrame

DataFrame is distributed dataset with named columns, similar to 2D table in relational databases, supports multiple data sources (JSON, CSV, Parquet, JDBC, etc.).

Catalyst Optimizer

DataFrame’s biggest advantage is automatic query optimization. Catalyst optimizer performs:

  • Predicate Pushdown
  • Column Pruning
  • Constant Folding
  • Join Reordering

API Examples

// SQL style
df.select("name", "salary").filter(df("salary") > 5000)

// DSL operations
df.groupBy("department").agg(Map("salary" -> "avg"))

Notes

DataFrame only has runtime type checking, lacks compile-time type safety. Dataset[Row] is DataFrame’s essence.


Dataset

Dataset introduced in Spark 1.6, combines RDD’s strong typing with DataFrame’s optimization capabilities.

Core Advantages

  • Compile-time type safety: Type errors exposed at compile time, not runtime
  • Optimized execution: Reuses Catalyst and Tungsten engines
  • Unified API: Supports both RDD-style map/filter and SQL-style operations

Creation Examples

// Create from Range
val numDS = spark.range(5, 100, 5)
numDS.orderBy(desc("id")).show(5)

// Create from case class collection
case class Person(name: String, age: Int, height: Int)
val seq = Seq(Person("Jack", 28, 184), Person("Tom", 10, 144))
val ds = spark.createDataset(seq)
ds.printSchema()

DataFrame is a special form of Dataset: Dataset[Row], don’t artificially separate them.


Mutual Conversions

SourceTargetMethod
RDDDataFrametoDF() or createDataFrame(rdd, schema)
DataFrameRDD.rdd property
DataFrameDataset.as[T] (T is case class)
DatasetDataFrame.toDF()
RDDDataset.toDS() (needs type match)
DatasetRDD.rdd property

RDD to DataFrame Example

val arr = Array(("Alice", 30, 170), ("Bob", 25, 175))
val rdd = sc.makeRDD(arr).map(f => Row(f._1, f._2, f._3))

val schema = StructType(Seq(
  StructField("name", StringType, false),
  StructField("age", IntegerType, false),
  StructField("height", IntegerType, false)
))

val df = spark.createDataFrame(rdd, schema)
df.show()

##横向对比总结

DimensionRDDDataFrameDataset
Type safetyRuntimeRuntimeCompile-time
Query optimizationNoneCatalystCatalyst
Serialization efficiencyJava/KryoTungstenTungsten
Use casesUnstructured data, graph computingStructured queries, ETLType-sensitive structured processing
  • RDD: Suitable for unstructured data and iterative computing, but lacks optimization intelligence
  • DataFrame: Best performance for structured queries, but sacrifices type safety
  • Dataset: Compile-time type checking + runtime query optimization, best choice for type-aware development