This is article 68 in the Big Data series, step-by-step Apache Spark distributed computing environment setup, extending deployment based on existing three-node Hadoop cluster.

Environment Prerequisites

This article assumes previous articles in the series have completed:

  • Three Linux nodes (h121, h122, h123), SSH passwordless configured
  • Hadoop 3.x cluster running (HDFS + YARN)
  • JDK 8 installed, JAVA_HOME environment variable configured

Download Spark

Download Spark matching Hadoop version from Apache official archive (choose without-hadoop version to avoid Jar conflicts):

# Download Spark 2.4.5 (without-hadoop version)
wget https://archive.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-without-hadoop-scala-2.12.tgz

# Extract to unified installation directory
tar -zxvf spark-2.4.5-bin-without-hadoop-scala-2.12.tgz -C /opt/servers/
cd /opt/servers/
mv spark-2.4.5-bin-without-hadoop-scala-2.12 spark-2.4.5

Configure Environment Variables

Append Spark environment variables to /etc/profile (execute on all three nodes):

export SPARK_HOME=/opt/servers/spark-2.4.5
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

Execute source /etc/profile to make config effective.

Configure Core Files

Enter $SPARK_HOME/conf directory, copy template files and modify:

slaves — Worker Node List

cp slaves.template slaves

Edit slaves, fill in all Worker node hostnames:

h121.wzk.icu
h122.wzk.icu
h123.wzk.icu

spark-env.sh — Environment Parameters

cp spark-env.sh.template spark-env.sh

Append at file end:

export JAVA_HOME=/opt/servers/jdk1.8
export HADOOP_HOME=/opt/servers/hadoop-3.1.3
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export SPARK_HOME=/opt/servers/spark-2.4.5
export SPARK_MASTER_HOST=h121.wzk.icu
export SPARK_MASTER_PORT=7077

spark-defaults.conf — Default Config

cp spark-defaults.conf.template spark-defaults.conf

Add key parameters:

spark.master                     spark://h121.wzk.icu:7077
spark.eventLog.enabled           true
spark.eventLog.dir               hdfs://h121.wzk.icu:8020/spark-logs
spark.driver.memory              512m
spark.serializer                 org.apache.spark.serializer.KryoSerializer

Note: Event log directory must be created on HDFS first: hadoop fs -mkdir -p /spark-logs

Distribute Config to Other Nodes

Use rsync to synchronize config files to h122, h123:

rsync -av /opt/servers/spark-2.4.5 root@h122.wzk.icu:/opt/servers/
rsync -av /opt/servers/spark-2.4.5 root@h123.wzk.icu:/opt/servers/

# Sync environment variables (or manually add on each node)
rsync -av /etc/profile root@h122.wzk.icu:/etc/profile
rsync -av /etc/profile root@h123.wzk.icu:/etc/profile

Start Cluster

Execute on Master node (h121):

# Start all nodes (Master + all Workers)
$SPARK_HOME/sbin/start-all.sh

# Verify processes
jps
# h121: Master
# h122/h123: Worker

Access Spark Web UI: http://h121.wzk.icu:8080, can see registered Worker nodes and available resources.

Verify: Run Example Program

Use built-in Pi estimation example to verify cluster is working:

spark-submit \
  --master spark://h121.wzk.icu:7077 \
  --class org.apache.spark.examples.SparkPi \
  $SPARK_HOME/examples/jars/spark-examples_2.12-2.4.5.jar \
  100

After execution, console outputs Pi is roughly 3.14... indicating cluster deployment successful.

Common Issues

IssueInvestigation Direction
Worker cannot register to MasterCheck if firewall opens port 7077; confirm SPARK_MASTER_HOST in spark-env.sh is correct
Event log write failureConfirm HDFS /spark-logs directory created and has write permission
Serialization exceptionConfirm KryoSerializer config correct, custom classes registered

After Spark Standalone cluster setup, next is learning RDD concept and operations, the core data abstraction.