This is article 5 in the Big Data series. Execute the first true distributed MapReduce computing—WordCount—on the completed three-node cluster.
Complete illustrated version: CSDN Original | Juejin
HDFS Architecture Review
HDFS uses Master/Slave architecture:
- NameNode: Manages file system namespace, records mapping from Blocks to DataNodes
- DataNode: Stores actual data Blocks, reports to NameNode via heartbeat
- Client: Splits files into Blocks when uploading, gets locations from NameNode, interacts directly with DataNodes
HDFS design principles: High fault tolerance (default 3 replicas), high throughput (sequential read/write), suitable for large file batch processing, not suitable for low-latency random read/write.
Practice Steps
1. Prepare Test File
# Create test text locally
echo "hello hadoop hello world hadoop" > /opt/wzk/test.txt
2. Create HDFS Directory and Upload
hdfs dfs -mkdir -p /test/input
hdfs dfs -put /opt/wzk/test.txt /test/input
hdfs dfs -ls /test/input
3. Submit WordCount Job
Hadoop comes with WordCount example jar:
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar \
wordcount /test/input /wcoutput
4. View Job via YARN UI
Access http://h123.wzk.icu:8088/cluster/apps, can see wordcount job in RUNNING state, changes to SUCCEEDED after completion.
5. View Computing Results
# List output directory
hdfs dfs -ls /wcoutput
# View results
hdfs dfs -cat /wcoutput/part-r-00000
Expected output:
hadoop 2
hello 2
world 1
6. Download Results to Local
hdfs dfs -get /wcoutput/part-r-00000 /opt/wzk/result.txt
MapReduce Working Principle Summary
WordCount goes through three stages:
- Map: Tokenize each line of text, output
(word, 1)key-value pairs - Shuffle: Aggregate data with same key to the same Reducer
- Reduce: Sum counts for each word, output final result
In the three-node cluster, Map tasks are distributed across multiple DataNodes for parallel execution, demonstrating true distributed computing.
Next article: Big Data 06 - JobHistory Server Configuration