Spark Scala Practice: Pi Estimation & Mutual Friends

This is article 74 in the Big Data series, through two classic Spark cases — Monte Carlo Pi estimation and mutual friends analysis — further familiarize with RDD distributed computing patterns and operator combination techniques.

Case 1: Monte Carlo Method Pi Estimation

Mathematical Principle

Randomly drop points in a square with side length 1, ratio of points inside unit circle to total points approaches π/4:

Circle area / Square area = π·r² / (2r)² = π/4

Therefore π ≈ 4 × (points inside circle / total points). More samples, more accurate result.

Spark Implementation

package icu.wzk

import org.apache.spark.{SparkConf, SparkContext}
import scala.math.random

object SparkPi {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
      .setAppName("ScalaSparkPi")
      .setMaster("local[*]")
    val sc = new SparkContext(conf)
    sc.setLogLevel("WARN")

    // Pass partition count via args(0), default 15
    val slices = if (args.length > 0) args(0).toInt else 15
    val N = 100000000  // 100 million samples

    val count = sc.makeRDD(1 to N, slices)
      .map { _ =>
        val (x, y) = (random, random)
        if (x * x + y * y <= 1) 1 else 0
      }
      .reduce(_ + _)

    println(s"Pi is approximately ${4.0 * count / N}")
    sc.stop()
  }
}

Key Points

sc.makeRDD(1 to N, slices): Distribute 100 million integers evenly to slices partitions, each Executor processes part in parallel
map: Each task independently generates random point and judges if inside circle (no dependency, naturally suitable for parallel)
reduce(_ + _): Aggregate hit counts from each partition
More partitions, higher parallelism, but scheduling overhead also increases. Typically set to 2-3x CPU cores

Running

# Local mode, 15 partitions
spark-submit --master local[*] \
  --class icu.wzk.SparkPi \
  spark-wordcount-1.0-SNAPSHOT.jar 15

Output example: Pi is approximately 3.14159268

Case 2: Mutual Friends Analysis

Problem Description

Given social network friend relationship data (each line format: userID friend1,friend2,friends3,...), find mutual friends list between any two users.

Sample data:

100 200,300,400,500
200 100,300,600
300 100,200,400

Method 1: Cartesian Product (Intuitive Solution)

Pair all users, take intersection of friend lists:

val friendsRDD = sc.textFile(input).map { line =>
  val parts = line.split("\\s+")
  (parts(0).toInt, parts(1).split(",").map(_.toInt).toSet)
}

friendsRDD.cartesian(friendsRDD)
  .filter { case ((id1, _), (id2, _)) => id1 < id2 }  // Deduplicate
  .map { case ((id1, f1), (id2, f2)) =>
    (s"$id1 & $id2", f1.intersect(f2))
  }
  .collect()

Disadvantage: cartesian produces O(n²) data volume, extremely poor performance when n is large, not suitable for production.

Method 2: Data Transformation (Recommended)

Core idea: Reverse perspective — not “what is intersection of two users’ friends”, but “for each pair of friends, who are their mutual acquaintances”.

val friendsRDD = sc.textFile(input).map { line =>
  val parts = line.split("\\s+")
  val user = parts(0)
  val friends = parts(1).split(",")
  (user, friends)
}

friendsRDD
  // Generate all pairwise combinations for each user's friend list
  .flatMapValues(friends => friends.combinations(2).toSeq)
  // Reverse: (user, friend pair) -> (friend pair, user)
  .map { case (user, pair) => (pair.mkString(" & "), Set(user)) }
  // Merge by friend pair, get set of users who mutually know that pair
  .reduceByKey(_ | _)
  .collect()

Advantages:

Avoids Cartesian product, time complexity reduced from O(n²) to O(n·k²) (k = average friends count)
Uses reduceByKey instead of groupByKey, reduces Shuffle data volume
Code is cleaner, easy to extend to multi-hop relationship analysis

Result Example

(100 & 200, Set(300))        // Mutual friends of 100 and 200 is 300
(100 & 300, Set(200, 400))   // Mutual friends of 100 and 300 are 200, 400
(200 & 300, Set(100))        // Mutual friends of 200 and 300 is 100

Common Thought in Both Cases

Feature	Pi Estimation	Mutual Friends
Parallelization method	Data partition independent compute	flatMap expand then reduceByKey
Key operators	makeRDD + map + reduce	flatMapValues + map + reduceByKey
Performance key	Partition count (parallelism)	Avoid cartesian, use reduceByKey
Mathematical essence	Law of large numbers (random sampling)	Set operations (intersection/union)

Both cases embody Spark’s core paradigm: break problem into independent local computations, then aggregate to get global result.