SparkSQL Introduction: SQL & Distributed Computing Fusion

This is article 80 in the Big Data series, comprehensively introducing SparkSQL design philosophy, core components and practical usage.

From Hive to SparkSQL Evolution

Early Hive translates SQL to MapReduce execution, each query needs cold start MR Job, high latency, slow iteration. Shark (Spark 0.x era) speeds up queries by replacing Hive’s execution engine while maintaining HiveQL compatibility.

In 2014, Spark abandoned Shark, split into two directions:

Hive on Spark: Let Hive replace execution engine with Spark, still dominated by Hive
Spark SQL: Spark-led structured data processing module, provides brand new API system

Spark SQL became mainstream, performance improvement 10-100x over Hive, comes from memory optimization (4-5x memory efficiency), Catalyst query optimizer and code generation technology.

Core Data Abstractions

DataFrame

DataFrame is distributed Row collection with Schema information, equivalent to RDD[Row] + Schema. Schema describes each column’s field name and type, enabling engine optimizations like column pruning and predicate pushdown.

Supported data types:

Primitive types: String, Boolean, Integer, Long, Double, Decimal, Date, Timestamp
Complex types: Array, Map, Struct (supports nesting)

val df = spark.read.json("hdfs://data/users.json")
df.printSchema()
// root
//  |-- age: long (nullable = true)
//  |-- name: string (nullable = true)

df.select("name", "age")
  .filter($"age" > 18)
  .groupBy("age")
  .count()
  .show()

Dataset

Introduced in Spark 1.6, adds compile-time type checking on top of DataFrame to avoid runtime type errors. Spark 2.0 unified both: DataFrame = Dataset[Row].

case class User(name: String, age: Long)

val ds: Dataset[User] = df.as[User]
ds.filter(_.age > 18).map(_.name).show()

Catalyst Optimizer

Catalyst is SparkSQL’s query optimization engine, processing flow has four steps:

SQL/DataFrame API
      │
      ▼
① Parsing → Unresolved Logical Plan
      │
      ▼
② Analysis → Resolved Logical Plan (bind column names, type inference)
      │
      ▼
③ Optimization → Optimized Logical Plan
      │  · Predicate Pushdown
      │  · Column Pruning
      │  · Constant Folding
      │  · Partition Pruning
      ▼
④ Physical Plan Generation → Select optimal execution strategy (sort merge join / broadcast join, etc.)

SQL Query Examples

SparkSQL supports full ANSI SQL, including window functions, subqueries and other advanced features:

spark.sql("""
  SELECT
    department,
    name,
    salary,
    RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS rank
  FROM employees
  WHERE salary > 50000
""").show()

Can also mix DataFrame API and SQL:

val employeesDF = spark.read.parquet("hdfs://data/employees")
employeesDF.createOrReplaceTempView("employees")

// Then can query this temporary view via spark.sql()
spark.sql("SELECT department, AVG(salary) FROM employees GROUP BY department").show()

Multi-Data Source Integration

SparkSQL natively supports unified read/write of multiple data sources:

// Parquet (recommended format, columnar storage)
val df1 = spark.read.parquet("hdfs://data/events")

// JSON
val df2 = spark.read.json("hdfs://data/users.json")

// CSV
val df3 = spark.read.option("header", "true").csv("hdfs://data/orders.csv")

// JDBC (MySQL, PostgreSQL, etc.)
val df4 = spark.read.format("jdbc")
  .option("url", "jdbc:mysql://localhost:3306/mydb")
  .option("dbtable", "orders")
  .option("user", "root")
  .option("password", "password")
  .load()

// Write to Parquet, partitioned by date
df4.write.partitionBy("date").parquet("hdfs://output/orders")

SparkSQL vs Traditional Hive Comparison

Dimension	Hive	SparkSQL
Execution Engine	MapReduce	Spark (in-memory computing)
Query latency	Minute-level	Second-level (millisecond-level for small queries)
Iterative computation	Inefficient	Efficient (cache intermediate results in memory)
API	HiveQL	SQL + DataFrame + Dataset
Applicable scenario	Offline large batch ETL	Interactive queries, real-time stream processing

SparkSQL has become the preferred solution for structured processing in big data stack. Combined with Spark Streaming or Structured Streaming, can also unify batch and stream processing logic.