Hadoop Cluster WordCount Distributed Computing Practice

This is article 5 in the Big Data series. Execute the first true distributed MapReduce computing—WordCount—on the completed three-node cluster.

Complete illustrated version: CSDN Original | Juejin

HDFS Architecture Review

HDFS uses Master/Slave architecture:

NameNode: Manages file system namespace, records mapping from Blocks to DataNodes
DataNode: Stores actual data Blocks, reports to NameNode via heartbeat
Client: Splits files into Blocks when uploading, gets locations from NameNode, interacts directly with DataNodes

HDFS design principles: High fault tolerance (default 3 replicas), high throughput (sequential read/write), suitable for large file batch processing, not suitable for low-latency random read/write.

Practice Steps

1. Prepare Test File

# Create test text locally
echo "hello hadoop hello world hadoop" > /opt/wzk/test.txt

2. Create HDFS Directory and Upload

hdfs dfs -mkdir -p /test/input
hdfs dfs -put /opt/wzk/test.txt /test/input
hdfs dfs -ls /test/input

3. Submit WordCount Job

Hadoop comes with WordCount example jar:

hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar \
  wordcount /test/input /wcoutput

4. View Job via YARN UI

Access http://h123.wzk.icu:8088/cluster/apps, can see wordcount job in RUNNING state, changes to SUCCEEDED after completion.

5. View Computing Results

# List output directory
hdfs dfs -ls /wcoutput

# View results
hdfs dfs -cat /wcoutput/part-r-00000

Expected output:

hadoop  2
hello   2
world   1

6. Download Results to Local

hdfs dfs -get /wcoutput/part-r-00000 /opt/wzk/result.txt

MapReduce Working Principle Summary

WordCount goes through three stages:

Map: Tokenize each line of text, output (word, 1) key-value pairs
Shuffle: Aggregate data with same key to the same Reducer
Reduce: Sum counts for each word, output final result

In the three-node cluster, Map tasks are distributed across multiple DataNodes for parallel execution, demonstrating true distributed computing.

Next article: Big Data 06 - JobHistory Server Configuration