This is article 74 in the Big Data series, through two classic Spark cases — Monte Carlo Pi estimation and mutual friends analysis — further familiarize with RDD distributed computing patterns and operator combination techniques.

Case 1: Monte Carlo Method Pi Estimation

Mathematical Principle

Randomly drop points in a square with side length 1, ratio of points inside unit circle to total points approaches π/4:

Circle area / Square area = π·r² / (2r)² = π/4

Therefore π ≈ 4 × (points inside circle / total points). More samples, more accurate result.

Spark Implementation

package icu.wzk

import org.apache.spark.{SparkConf, SparkContext}
import scala.math.random

object SparkPi {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
      .setAppName("ScalaSparkPi")
      .setMaster("local[*]")
    val sc = new SparkContext(conf)
    sc.setLogLevel("WARN")

    // Pass partition count via args(0), default 15
    val slices = if (args.length > 0) args(0).toInt else 15
    val N = 100000000  // 100 million samples

    val count = sc.makeRDD(1 to N, slices)
      .map { _ =>
        val (x, y) = (random, random)
        if (x * x + y * y <= 1) 1 else 0
      }
      .reduce(_ + _)

    println(s"Pi is approximately ${4.0 * count / N}")
    sc.stop()
  }
}

Key Points

  • sc.makeRDD(1 to N, slices): Distribute 100 million integers evenly to slices partitions, each Executor processes part in parallel
  • map: Each task independently generates random point and judges if inside circle (no dependency, naturally suitable for parallel)
  • reduce(_ + _): Aggregate hit counts from each partition
  • More partitions, higher parallelism, but scheduling overhead also increases. Typically set to 2-3x CPU cores

Running

# Local mode, 15 partitions
spark-submit --master local[*] \
  --class icu.wzk.SparkPi \
  spark-wordcount-1.0-SNAPSHOT.jar 15

Output example: Pi is approximately 3.14159268

Case 2: Mutual Friends Analysis

Problem Description

Given social network friend relationship data (each line format: userID friend1,friend2,friends3,...), find mutual friends list between any two users.

Sample data:

100 200,300,400,500
200 100,300,600
300 100,200,400

Method 1: Cartesian Product (Intuitive Solution)

Pair all users, take intersection of friend lists:

val friendsRDD = sc.textFile(input).map { line =>
  val parts = line.split("\\s+")
  (parts(0).toInt, parts(1).split(",").map(_.toInt).toSet)
}

friendsRDD.cartesian(friendsRDD)
  .filter { case ((id1, _), (id2, _)) => id1 < id2 }  // Deduplicate
  .map { case ((id1, f1), (id2, f2)) =>
    (s"$id1 & $id2", f1.intersect(f2))
  }
  .collect()

Disadvantage: cartesian produces O(n²) data volume, extremely poor performance when n is large, not suitable for production.

Core idea: Reverse perspective — not “what is intersection of two users’ friends”, but “for each pair of friends, who are their mutual acquaintances”.

val friendsRDD = sc.textFile(input).map { line =>
  val parts = line.split("\\s+")
  val user = parts(0)
  val friends = parts(1).split(",")
  (user, friends)
}

friendsRDD
  // Generate all pairwise combinations for each user's friend list
  .flatMapValues(friends => friends.combinations(2).toSeq)
  // Reverse: (user, friend pair) -> (friend pair, user)
  .map { case (user, pair) => (pair.mkString(" & "), Set(user)) }
  // Merge by friend pair, get set of users who mutually know that pair
  .reduceByKey(_ | _)
  .collect()

Advantages:

  • Avoids Cartesian product, time complexity reduced from O(n²) to O(n·k²) (k = average friends count)
  • Uses reduceByKey instead of groupByKey, reduces Shuffle data volume
  • Code is cleaner, easy to extend to multi-hop relationship analysis

Result Example

(100 & 200, Set(300))        // Mutual friends of 100 and 200 is 300
(100 & 300, Set(200, 400))   // Mutual friends of 100 and 300 are 200, 400
(200 & 300, Set(100))        // Mutual friends of 200 and 300 is 100

Common Thought in Both Cases

FeaturePi EstimationMutual Friends
Parallelization methodData partition independent computeflatMap expand then reduceByKey
Key operatorsmakeRDD + map + reduceflatMapValues + map + reduceByKey
Performance keyPartition count (parallelism)Avoid cartesian, use reduceByKey
Mathematical essenceLaw of large numbers (random sampling)Set operations (intersection/union)

Both cases embody Spark’s core paradigm: break problem into independent local computations, then aggregate to get global result.