This is article 68 in the Big Data series, step-by-step Apache Spark distributed computing environment setup, extending deployment based on existing three-node Hadoop cluster.
Environment Prerequisites
This article assumes previous articles in the series have completed:
- Three Linux nodes (
h121,h122,h123), SSH passwordless configured - Hadoop 3.x cluster running (HDFS + YARN)
- JDK 8 installed,
JAVA_HOMEenvironment variable configured
Download Spark
Download Spark matching Hadoop version from Apache official archive (choose without-hadoop version to avoid Jar conflicts):
# Download Spark 2.4.5 (without-hadoop version)
wget https://archive.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-without-hadoop-scala-2.12.tgz
# Extract to unified installation directory
tar -zxvf spark-2.4.5-bin-without-hadoop-scala-2.12.tgz -C /opt/servers/
cd /opt/servers/
mv spark-2.4.5-bin-without-hadoop-scala-2.12 spark-2.4.5
Configure Environment Variables
Append Spark environment variables to /etc/profile (execute on all three nodes):
export SPARK_HOME=/opt/servers/spark-2.4.5
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
Execute source /etc/profile to make config effective.
Configure Core Files
Enter $SPARK_HOME/conf directory, copy template files and modify:
slaves — Worker Node List
cp slaves.template slaves
Edit slaves, fill in all Worker node hostnames:
h121.wzk.icu
h122.wzk.icu
h123.wzk.icu
spark-env.sh — Environment Parameters
cp spark-env.sh.template spark-env.sh
Append at file end:
export JAVA_HOME=/opt/servers/jdk1.8
export HADOOP_HOME=/opt/servers/hadoop-3.1.3
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export SPARK_HOME=/opt/servers/spark-2.4.5
export SPARK_MASTER_HOST=h121.wzk.icu
export SPARK_MASTER_PORT=7077
spark-defaults.conf — Default Config
cp spark-defaults.conf.template spark-defaults.conf
Add key parameters:
spark.master spark://h121.wzk.icu:7077
spark.eventLog.enabled true
spark.eventLog.dir hdfs://h121.wzk.icu:8020/spark-logs
spark.driver.memory 512m
spark.serializer org.apache.spark.serializer.KryoSerializer
Note: Event log directory must be created on HDFS first:
hadoop fs -mkdir -p /spark-logs
Distribute Config to Other Nodes
Use rsync to synchronize config files to h122, h123:
rsync -av /opt/servers/spark-2.4.5 root@h122.wzk.icu:/opt/servers/
rsync -av /opt/servers/spark-2.4.5 root@h123.wzk.icu:/opt/servers/
# Sync environment variables (or manually add on each node)
rsync -av /etc/profile root@h122.wzk.icu:/etc/profile
rsync -av /etc/profile root@h123.wzk.icu:/etc/profile
Start Cluster
Execute on Master node (h121):
# Start all nodes (Master + all Workers)
$SPARK_HOME/sbin/start-all.sh
# Verify processes
jps
# h121: Master
# h122/h123: Worker
Access Spark Web UI: http://h121.wzk.icu:8080, can see registered Worker nodes and available resources.
Verify: Run Example Program
Use built-in Pi estimation example to verify cluster is working:
spark-submit \
--master spark://h121.wzk.icu:7077 \
--class org.apache.spark.examples.SparkPi \
$SPARK_HOME/examples/jars/spark-examples_2.12-2.4.5.jar \
100
After execution, console outputs Pi is roughly 3.14... indicating cluster deployment successful.
Common Issues
| Issue | Investigation Direction |
|---|---|
| Worker cannot register to Master | Check if firewall opens port 7077; confirm SPARK_MASTER_HOST in spark-env.sh is correct |
| Event log write failure | Confirm HDFS /spark-logs directory created and has write permission |
| Serialization exception | Confirm KryoSerializer config correct, custom classes registered |
After Spark Standalone cluster setup, next is learning RDD concept and operations, the core data abstraction.