This is article 80 in the Big Data series, comprehensively introducing SparkSQL design philosophy, core components and practical usage.
From Hive to SparkSQL Evolution
Early Hive translates SQL to MapReduce execution, each query needs cold start MR Job, high latency, slow iteration. Shark (Spark 0.x era) speeds up queries by replacing Hive’s execution engine while maintaining HiveQL compatibility.
In 2014, Spark abandoned Shark, split into two directions:
- Hive on Spark: Let Hive replace execution engine with Spark, still dominated by Hive
- Spark SQL: Spark-led structured data processing module, provides brand new API system
Spark SQL became mainstream, performance improvement 10-100x over Hive, comes from memory optimization (4-5x memory efficiency), Catalyst query optimizer and code generation technology.
Core Data Abstractions
DataFrame
DataFrame is distributed Row collection with Schema information, equivalent to RDD[Row] + Schema. Schema describes each column’s field name and type, enabling engine optimizations like column pruning and predicate pushdown.
Supported data types:
- Primitive types: String, Boolean, Integer, Long, Double, Decimal, Date, Timestamp
- Complex types: Array, Map, Struct (supports nesting)
val df = spark.read.json("hdfs://data/users.json")
df.printSchema()
// root
// |-- age: long (nullable = true)
// |-- name: string (nullable = true)
df.select("name", "age")
.filter($"age" > 18)
.groupBy("age")
.count()
.show()
Dataset
Introduced in Spark 1.6, adds compile-time type checking on top of DataFrame to avoid runtime type errors. Spark 2.0 unified both: DataFrame = Dataset[Row].
case class User(name: String, age: Long)
val ds: Dataset[User] = df.as[User]
ds.filter(_.age > 18).map(_.name).show()
Catalyst Optimizer
Catalyst is SparkSQL’s query optimization engine, processing flow has four steps:
SQL/DataFrame API
│
▼
① Parsing → Unresolved Logical Plan
│
▼
② Analysis → Resolved Logical Plan (bind column names, type inference)
│
▼
③ Optimization → Optimized Logical Plan
│ · Predicate Pushdown
│ · Column Pruning
│ · Constant Folding
│ · Partition Pruning
▼
④ Physical Plan Generation → Select optimal execution strategy (sort merge join / broadcast join, etc.)
SQL Query Examples
SparkSQL supports full ANSI SQL, including window functions, subqueries and other advanced features:
spark.sql("""
SELECT
department,
name,
salary,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS rank
FROM employees
WHERE salary > 50000
""").show()
Can also mix DataFrame API and SQL:
val employeesDF = spark.read.parquet("hdfs://data/employees")
employeesDF.createOrReplaceTempView("employees")
// Then can query this temporary view via spark.sql()
spark.sql("SELECT department, AVG(salary) FROM employees GROUP BY department").show()
Multi-Data Source Integration
SparkSQL natively supports unified read/write of multiple data sources:
// Parquet (recommended format, columnar storage)
val df1 = spark.read.parquet("hdfs://data/events")
// JSON
val df2 = spark.read.json("hdfs://data/users.json")
// CSV
val df3 = spark.read.option("header", "true").csv("hdfs://data/orders.csv")
// JDBC (MySQL, PostgreSQL, etc.)
val df4 = spark.read.format("jdbc")
.option("url", "jdbc:mysql://localhost:3306/mydb")
.option("dbtable", "orders")
.option("user", "root")
.option("password", "password")
.load()
// Write to Parquet, partitioned by date
df4.write.partitionBy("date").parquet("hdfs://output/orders")
SparkSQL vs Traditional Hive Comparison
| Dimension | Hive | SparkSQL |
|---|---|---|
| Execution Engine | MapReduce | Spark (in-memory computing) |
| Query latency | Minute-level | Second-level (millisecond-level for small queries) |
| Iterative computation | Inefficient | Efficient (cache intermediate results in memory) |
| API | HiveQL | SQL + DataFrame + Dataset |
| Applicable scenario | Offline large batch ETL | Interactive queries, real-time stream processing |
SparkSQL has become the preferred solution for structured processing in big data stack. Combined with Spark Streaming or Structured Streaming, can also unify batch and stream processing logic.