This is article 73 in the Big Data series, through big data’s “Hello World” — WordCount, master Spark RDD core processing thinking, and provide complete Scala and Java implementations.

Why Start with WordCount

WordCount is the classic entry-level for distributed computing. Not only simple, but also fully embodies Spark’s core “divide and conquer” thought:

  1. Data loading: Read raw text from file system
  2. Split: Split lines into words (flatMap)
  3. Map: Map each word to (word, 1) key-value pair
  4. Aggregate: Aggregate counts by key (reduceByKey)
  5. Output: Write to file or print to console

These five steps correspond one-to-one with Hadoop MapReduce’s Map/Shuffle/Reduce, but code volume significantly reduced.

Maven Project Configuration

<properties>
    <scala.version>2.12.10</scala.version>
    <spark.version>2.4.5</spark.version>
</properties>

<dependencies>
    <!-- Spark Core -->
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.12</artifactId>
        <version>${spark.version}</version>
    </dependency>
    <!-- Scala standard library -->
    <dependency>
        <groupId>org.scala-lang</groupId>
        <artifactId>scala-library</artifactId>
        <version>${scala.version}</version>
    </dependency>
</dependencies>

Scala Implementation

package icu.wzk

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

object WordCount {
  def main(args: Array[String]): Unit = {
    // 1. Create SparkConf and SparkContext
    val conf = new SparkConf().setAppName("ScalaWordCount")
    val sc = new SparkContext(conf)
    sc.setLogLevel("WARN")

    // 2. Read input file, generate RDD[String] (each element is one line)
    val lines: RDD[String] = sc.textFile(args(0))

    // 3. Split by whitespace, generate RDD[String] (each element is one word)
    val words: RDD[String] = lines.flatMap(line => line.split("\\s+"))

    // 4. Map to (word, 1) key-value pair
    val wordMap: RDD[(String, Int)] = words.map(x => (x, 1))

    // 5. Aggregate by key
    val result: RDD[(String, Int)] = wordMap.reduceByKey(_ + _)

    // 6. Trigger Action, output result
    result.foreach(println)

    sc.stop()
  }
}

Scala version is concise and intuitive, reduceByKey(_ + _) uses functional syntax to save lots of boilerplate code.

Java Implementation

package icu.wzk;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import scala.Tuple2;
import java.util.Arrays;

public class JavaWordCount {
    public static void main(String[] args) {
        SparkConf conf = new SparkConf()
                .setAppName("JavaWordCount")
                .setMaster("local[*]");
        JavaSparkContext sc = new JavaSparkContext(conf);
        sc.setLogLevel("WARN");

        // Read text file
        JavaRDD<String> lines = sc.textFile(args[0]);

        // Split words
        JavaRDD<String> words = lines
                .flatMap(line -> Arrays.stream(line.split("\\s+")).iterator());

        // Map to (word, 1)
        JavaPairRDD<String, Integer> wordsMap = words
                .mapToPair(word -> new Tuple2<>(word, 1));

        // Aggregate by key
        JavaPairRDD<String, Integer> results =
                wordsMap.reduceByKey((x, y) -> x + y);

        results.foreach(elem -> System.out.println(elem));
        sc.stop();
    }
}

Java version uses Java-friendly APIs like JavaSparkContext, JavaRDD, JavaPairRDD, logic is same but code more verbose.

Compile and Submit

Package

mvn clean package -DskipTests

Generates target/spark-wordcount-1.0-SNAPSHOT.jar.

Run in Local Mode (Testing)

spark-submit \
  --master local[*] \
  --class icu.wzk.WordCount \
  target/spark-wordcount-1.0-SNAPSHOT.jar \
  /opt/wzk/input.txt

Submit to YARN Cluster

spark-submit \
  --master yarn \
  --deploy-mode cluster \
  --executor-memory 2g \
  --num-executors 3 \
  --class icu.wzk.WordCount \
  target/spark-wordcount-1.0-SNAPSHOT.jar \
  hdfs://namenode:9000/input/data.txt

Key Concepts Review

ConceptDescription
RDDResilient Distributed Dataset, Spark core abstraction, supports fault tolerance and parallel computation
Lazy evaluationTransformation only records operation description, Action triggers actual execution
ShuffleOperators like reduceByKey trigger cross-node data redistribution, is performance bottleneck
DAGSpark represents job as Directed Acyclic Graph, Stages divided by Shuffle boundaries

Practical Application Scenarios

WordCount pattern can be directly migrated to:

  • Log analysis: Count ERROR/WARN frequency
  • Keyword statistics: E-commerce review hot word extraction
  • Text feature extraction: Term frequency calculation step in TF-IDF
  • Recommendation system: Item frequency statistics in user behavior sequences