This is article 73 in the Big Data series, through big data’s “Hello World” — WordCount, master Spark RDD core processing thinking, and provide complete Scala and Java implementations.
Why Start with WordCount
WordCount is the classic entry-level for distributed computing. Not only simple, but also fully embodies Spark’s core “divide and conquer” thought:
- Data loading: Read raw text from file system
- Split: Split lines into words (flatMap)
- Map: Map each word to
(word, 1)key-value pair - Aggregate: Aggregate counts by key (reduceByKey)
- Output: Write to file or print to console
These five steps correspond one-to-one with Hadoop MapReduce’s Map/Shuffle/Reduce, but code volume significantly reduced.
Maven Project Configuration
<properties>
<scala.version>2.12.10</scala.version>
<spark.version>2.4.5</spark.version>
</properties>
<dependencies>
<!-- Spark Core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- Scala standard library -->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
</dependencies>
Scala Implementation
package icu.wzk
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object WordCount {
def main(args: Array[String]): Unit = {
// 1. Create SparkConf and SparkContext
val conf = new SparkConf().setAppName("ScalaWordCount")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")
// 2. Read input file, generate RDD[String] (each element is one line)
val lines: RDD[String] = sc.textFile(args(0))
// 3. Split by whitespace, generate RDD[String] (each element is one word)
val words: RDD[String] = lines.flatMap(line => line.split("\\s+"))
// 4. Map to (word, 1) key-value pair
val wordMap: RDD[(String, Int)] = words.map(x => (x, 1))
// 5. Aggregate by key
val result: RDD[(String, Int)] = wordMap.reduceByKey(_ + _)
// 6. Trigger Action, output result
result.foreach(println)
sc.stop()
}
}
Scala version is concise and intuitive, reduceByKey(_ + _) uses functional syntax to save lots of boilerplate code.
Java Implementation
package icu.wzk;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import scala.Tuple2;
import java.util.Arrays;
public class JavaWordCount {
public static void main(String[] args) {
SparkConf conf = new SparkConf()
.setAppName("JavaWordCount")
.setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(conf);
sc.setLogLevel("WARN");
// Read text file
JavaRDD<String> lines = sc.textFile(args[0]);
// Split words
JavaRDD<String> words = lines
.flatMap(line -> Arrays.stream(line.split("\\s+")).iterator());
// Map to (word, 1)
JavaPairRDD<String, Integer> wordsMap = words
.mapToPair(word -> new Tuple2<>(word, 1));
// Aggregate by key
JavaPairRDD<String, Integer> results =
wordsMap.reduceByKey((x, y) -> x + y);
results.foreach(elem -> System.out.println(elem));
sc.stop();
}
}
Java version uses Java-friendly APIs like JavaSparkContext, JavaRDD, JavaPairRDD, logic is same but code more verbose.
Compile and Submit
Package
mvn clean package -DskipTests
Generates target/spark-wordcount-1.0-SNAPSHOT.jar.
Run in Local Mode (Testing)
spark-submit \
--master local[*] \
--class icu.wzk.WordCount \
target/spark-wordcount-1.0-SNAPSHOT.jar \
/opt/wzk/input.txt
Submit to YARN Cluster
spark-submit \
--master yarn \
--deploy-mode cluster \
--executor-memory 2g \
--num-executors 3 \
--class icu.wzk.WordCount \
target/spark-wordcount-1.0-SNAPSHOT.jar \
hdfs://namenode:9000/input/data.txt
Key Concepts Review
| Concept | Description |
|---|---|
| RDD | Resilient Distributed Dataset, Spark core abstraction, supports fault tolerance and parallel computation |
| Lazy evaluation | Transformation only records operation description, Action triggers actual execution |
| Shuffle | Operators like reduceByKey trigger cross-node data redistribution, is performance bottleneck |
| DAG | Spark represents job as Directed Acyclic Graph, Stages divided by Shuffle boundaries |
Practical Application Scenarios
WordCount pattern can be directly migrated to:
- Log analysis: Count ERROR/WARN frequency
- Keyword statistics: E-commerce review hot word extraction
- Text feature extraction: Term frequency calculation step in TF-IDF
- Recommendation system: Item frequency statistics in user behavior sequences