This is article 2 in the Big Data series, continuing from environment setup, focusing on explaining the meaning and writing of each Hadoop cluster XML configuration file.

Complete illustrated version: CSDN Original | Juejin

Hadoop Cluster Configuration = HDFS + MapReduce + YARN

Hadoop cluster configuration consists of three parts:

  • HDFS Cluster Configuration: core-site.xml, hdfs-site.xml
  • MapReduce Cluster Configuration: mapred-site.xml
  • YARN Cluster Configuration: yarn-site.xml

Three-node plan: h121 (NameNode), h122 (DataNode), h123 (SecondaryNameNode + ResourceManager)

1. hadoop-env.sh

Configure Java path to avoid JDK not found during cluster runtime:

export JAVA_HOME=/opt/jdk/jdk1.8.0_202

2. core-site.xml (NameNode Address)

<configuration>
  <!-- HDFS Access Address -->
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://h121.wzk.icu:9000</value>
  </property>
  <!-- Temporary Data Directory -->
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/opt/servers/hadoop-2.9.2/data/tmp</value>
  </property>
</configuration>

3. hdfs-site.xml (SecondaryNameNode + Replication Factor)

<configuration>
  <!-- SecondaryNameNode Address -->
  <property>
    <name>dfs.namenode.secondary.http-address</name>
    <value>h123.wzk.icu:50090</value>
  </property>
  <!-- Replication factor, set to 3 for 3-node cluster -->
  <property>
    <name>dfs.replication</name>
    <value>3</value>
  </property>
</configuration>

Also write three nodes in the slaves file:

h121.wzk.icu
h122.wzk.icu
h123.wzk.icu

4. mapred-site.xml (MapReduce Runtime Framework)

First copy the template file:

cp mapred-site.xml.template mapred-site.xml

Configure MapReduce to use YARN as the resource scheduling framework:

<configuration>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
</configuration>

5. yarn-site.xml (ResourceManager Address)

<configuration>
  <!-- ResourceManager Host -->
  <property>
    <name>yarn.resourcemanager.hostname</name>
    <value>h123.wzk.icu</value>
  </property>
  <!-- MapReduce Shuffle Service -->
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
</configuration>

Notes

  • When deploying on public cloud servers, domain names in /etc/hosts must map to 0.0.0.0 (not 127.x.x.x), otherwise services cannot be exposed externally
  • All configuration files need to be synchronized to three nodes (use rsync distribution script)
  • Unified permissions: chown -R hadoop:hadoop /opt/servers/hadoop-2.9.2

After configuration, continue to article 3: SSH Passwordless Login and Distribution Script