This is article 81 in the Big Data series, comprehensively introducing Spark’s three core data abstractions’ features, use cases and mutual conversions.
SparkSession: Unified Entry Point
Before Spark 2.0, developers needed to use multiple entry points:
SQLContext: Entry point for DataFrame creation and SQL executionHiveContext: Inherits from SQLContext, supports HiveQL syntax
After Spark 2.0, SparkSession unifies the above entry points, while holding reference to SparkContext:
val spark = SparkSession.builder()
.appName("Demo")
.master("local[*]")
.getOrCreate()
// Get SparkContext through SparkSession
val sc = spark.sparkContext
RDD (Resilient Distributed Dataset)
RDD is Spark’s lowest-level data abstraction, with following core features:
Immutability
Operations like map(), filter() generate new RDDs instead of modifying original data, which simplifies fault tolerance and enables safe parallel processing.
Fault Tolerance
- Lineage tracking: Records transformation sequence, can recompute from any ancestor RDD
- Checkpoint: Can persist RDD to stable storage, truncate lineage chain
- Partition recovery: Data distributed across cluster nodes, supports local recovery
Lazy Evaluation
- Transformation (like
map,filter,reduceByKey): Only records logic, doesn’t trigger computation - Action (like
count,collect,saveAsTextFile): Triggers actual computation
Creation Methods
// From collection
val rdd = sc.parallelize(Array(1, 2, 3, 4, 5))
// From external storage
val rdd = sc.textFile("hdfs://path/to/file")
// Transform from existing RDD
val newRdd = rdd.map(x => x * 2)
Use Cases
- Unstructured data processing
- Iterative machine learning algorithms
- ETL pipelines, graph computing (GraphX)
DataFrame
DataFrame is distributed dataset with named columns, similar to 2D table in relational databases, supports multiple data sources (JSON, CSV, Parquet, JDBC, etc.).
Catalyst Optimizer
DataFrame’s biggest advantage is automatic query optimization. Catalyst optimizer performs:
- Predicate Pushdown
- Column Pruning
- Constant Folding
- Join Reordering
API Examples
// SQL style
df.select("name", "salary").filter(df("salary") > 5000)
// DSL operations
df.groupBy("department").agg(Map("salary" -> "avg"))
Notes
DataFrame only has runtime type checking, lacks compile-time type safety. Dataset[Row] is DataFrame’s essence.
Dataset
Dataset introduced in Spark 1.6, combines RDD’s strong typing with DataFrame’s optimization capabilities.
Core Advantages
- Compile-time type safety: Type errors exposed at compile time, not runtime
- Optimized execution: Reuses Catalyst and Tungsten engines
- Unified API: Supports both RDD-style
map/filterand SQL-style operations
Creation Examples
// Create from Range
val numDS = spark.range(5, 100, 5)
numDS.orderBy(desc("id")).show(5)
// Create from case class collection
case class Person(name: String, age: Int, height: Int)
val seq = Seq(Person("Jack", 28, 184), Person("Tom", 10, 144))
val ds = spark.createDataset(seq)
ds.printSchema()
DataFrame is a special form of Dataset:
Dataset[Row], don’t artificially separate them.
Mutual Conversions
| Source | Target | Method |
|---|---|---|
| RDD | DataFrame | toDF() or createDataFrame(rdd, schema) |
| DataFrame | RDD | .rdd property |
| DataFrame | Dataset | .as[T] (T is case class) |
| Dataset | DataFrame | .toDF() |
| RDD | Dataset | .toDS() (needs type match) |
| Dataset | RDD | .rdd property |
RDD to DataFrame Example
val arr = Array(("Alice", 30, 170), ("Bob", 25, 175))
val rdd = sc.makeRDD(arr).map(f => Row(f._1, f._2, f._3))
val schema = StructType(Seq(
StructField("name", StringType, false),
StructField("age", IntegerType, false),
StructField("height", IntegerType, false)
))
val df = spark.createDataFrame(rdd, schema)
df.show()
##横向对比总结
| Dimension | RDD | DataFrame | Dataset |
|---|---|---|---|
| Type safety | Runtime | Runtime | Compile-time |
| Query optimization | None | Catalyst | Catalyst |
| Serialization efficiency | Java/Kryo | Tungsten | Tungsten |
| Use cases | Unstructured data, graph computing | Structured queries, ETL | Type-sensitive structured processing |
- RDD: Suitable for unstructured data and iterative computing, but lacks optimization intelligence
- DataFrame: Best performance for structured queries, but sacrifices type safety
- Dataset: Compile-time type checking + runtime query optimization, best choice for type-aware development