This is article 74 in the Big Data series, through two classic Spark cases — Monte Carlo Pi estimation and mutual friends analysis — further familiarize with RDD distributed computing patterns and operator combination techniques.
Case 1: Monte Carlo Method Pi Estimation
Mathematical Principle
Randomly drop points in a square with side length 1, ratio of points inside unit circle to total points approaches π/4:
Circle area / Square area = π·r² / (2r)² = π/4
Therefore π ≈ 4 × (points inside circle / total points). More samples, more accurate result.
Spark Implementation
package icu.wzk
import org.apache.spark.{SparkConf, SparkContext}
import scala.math.random
object SparkPi {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setAppName("ScalaSparkPi")
.setMaster("local[*]")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")
// Pass partition count via args(0), default 15
val slices = if (args.length > 0) args(0).toInt else 15
val N = 100000000 // 100 million samples
val count = sc.makeRDD(1 to N, slices)
.map { _ =>
val (x, y) = (random, random)
if (x * x + y * y <= 1) 1 else 0
}
.reduce(_ + _)
println(s"Pi is approximately ${4.0 * count / N}")
sc.stop()
}
}
Key Points
sc.makeRDD(1 to N, slices): Distribute 100 million integers evenly toslicespartitions, each Executor processes part in parallelmap: Each task independently generates random point and judges if inside circle (no dependency, naturally suitable for parallel)reduce(_ + _): Aggregate hit counts from each partition- More partitions, higher parallelism, but scheduling overhead also increases. Typically set to 2-3x CPU cores
Running
# Local mode, 15 partitions
spark-submit --master local[*] \
--class icu.wzk.SparkPi \
spark-wordcount-1.0-SNAPSHOT.jar 15
Output example: Pi is approximately 3.14159268
Case 2: Mutual Friends Analysis
Problem Description
Given social network friend relationship data (each line format: userID friend1,friend2,friends3,...), find mutual friends list between any two users.
Sample data:
100 200,300,400,500
200 100,300,600
300 100,200,400
Method 1: Cartesian Product (Intuitive Solution)
Pair all users, take intersection of friend lists:
val friendsRDD = sc.textFile(input).map { line =>
val parts = line.split("\\s+")
(parts(0).toInt, parts(1).split(",").map(_.toInt).toSet)
}
friendsRDD.cartesian(friendsRDD)
.filter { case ((id1, _), (id2, _)) => id1 < id2 } // Deduplicate
.map { case ((id1, f1), (id2, f2)) =>
(s"$id1 & $id2", f1.intersect(f2))
}
.collect()
Disadvantage: cartesian produces O(n²) data volume, extremely poor performance when n is large, not suitable for production.
Method 2: Data Transformation (Recommended)
Core idea: Reverse perspective — not “what is intersection of two users’ friends”, but “for each pair of friends, who are their mutual acquaintances”.
val friendsRDD = sc.textFile(input).map { line =>
val parts = line.split("\\s+")
val user = parts(0)
val friends = parts(1).split(",")
(user, friends)
}
friendsRDD
// Generate all pairwise combinations for each user's friend list
.flatMapValues(friends => friends.combinations(2).toSeq)
// Reverse: (user, friend pair) -> (friend pair, user)
.map { case (user, pair) => (pair.mkString(" & "), Set(user)) }
// Merge by friend pair, get set of users who mutually know that pair
.reduceByKey(_ | _)
.collect()
Advantages:
- Avoids Cartesian product, time complexity reduced from O(n²) to O(n·k²) (k = average friends count)
- Uses
reduceByKeyinstead ofgroupByKey, reduces Shuffle data volume - Code is cleaner, easy to extend to multi-hop relationship analysis
Result Example
(100 & 200, Set(300)) // Mutual friends of 100 and 200 is 300
(100 & 300, Set(200, 400)) // Mutual friends of 100 and 300 are 200, 400
(200 & 300, Set(100)) // Mutual friends of 200 and 300 is 100
Common Thought in Both Cases
| Feature | Pi Estimation | Mutual Friends |
|---|---|---|
| Parallelization method | Data partition independent compute | flatMap expand then reduceByKey |
| Key operators | makeRDD + map + reduce | flatMapValues + map + reduceByKey |
| Performance key | Partition count (parallelism) | Avoid cartesian, use reduceByKey |
| Mathematical essence | Law of large numbers (random sampling) | Set operations (intersection/union) |
Both cases embody Spark’s core paradigm: break problem into independent local computations, then aggregate to get global result.