Gleam Lab · Tag Archive

Tag: Hadoop

17 articles collected by topic for tutorials, cases, engineering practice, and research notes.

AI Investigation #51: Big Data Technology Evolution - Obsolete Frameworks, Architectures and the Reasons Behind Them

Big data technology evolution: MapReduce replaced by Spark, Storm replaced by Flink, Pig/Hive gradually phased out.

8/14/2025

AI Investigation #50: Big Data Evolution - Two Decades of Architectural Change from Hadoop to Flink

Two decades of big data evolution: from 2006 MapReduce batch processing to 2013 Spark in-memory computing, to 2019 Flink real-time computing.

8/13/2025

AI Research 49 - Big Data Survey Report: Development History from 1997 to 2025

Big data development began in 1997 when NASA proposed the concept, 2003-2006 Google published GFS, MapReduce, Bigtable three major papers leading distributed computing re...

8/12/2025

Big Data 140 - ClickHouse CollapsingMergeTree & External Data Sources

ClickHouse external data source engine guide: DDL templates, key parameters and read/write pipelines for ENGINE=HDFS, ENGINE=MySQL, ENGINE=Kafka, and distributed table co...

9/19/2024

Sqoop Practice: MySQL Full Data Import to HDFS

Complete example demonstrating Sqoop importing MySQL table data to HDFS, covering core parameter explanations, MapReduce parallel mechanism, and execution result verifica...

7/20/2024

MapReduce JOIN Four Implementation Strategies

This is article 11 in the Big Data series. Introduces four classic strategies for implementing multi-table JOIN in MapReduce framework and their Java implementations.

7/4/2024

Hive Introduction: Architecture and Cluster Installation

Introduction to Hive data warehouse core concepts, architecture components and pros/cons, with detailed steps to install and configure Hive 2.3.9 on three-node Hadoop clu...

7/4/2024

HDFS Java Client Practice: Upload/Download Files, Directory Operations and API Usage

This is article 9 in the Big Data series. Learn to operate HDFS through Java code, master Hadoop's Java Client API.

7/3/2024

Java Implementation MapReduce WordCount Complete Code

Implement Hadoop MapReduce WordCount from scratch: Hadoop serialization mechanism detailed explanation, writing Mapper, Reducer, Driver three components...

7/3/2024

HDFS Distributed File System Read/Write Principle

Deep dive into HDFS architecture: NameNode, DataNode, Client roles, Block storage mechanism, file read/write process (Pipeline write and nearest read), and HDFS basic com...

7/2/2024

HDFS CLI Practice Complete Command Guide

Complete HDFS CLI practice: hadoop fs common commands including directory operations, file upload/download, permission management, with three-node cluster live demo.

7/2/2024

Hadoop Cluster WordCount Distributed Computing Practice

Complete WordCount execution on Hadoop cluster: upload files to HDFS, submit MapReduce job, view running status through YARN UI, verify true distributed computing.

7/1/2024

Hadoop JobHistoryServer Configuration and Log Aggregation

Configure Hadoop JobHistoryServer to record MapReduce job execution history, enable YARN log aggregation, view job details and logs via Web UI.

7/1/2024

Hadoop Cluster SSH Passwordless Login Configuration and Distribution Script

Complete guide for Hadoop three-node cluster SSH passwordless login: generate RSA keys, distribute public keys, write rsync cluster distribution script.

6/30/2024

Hadoop Cluster Startup and Web UI Verification

Complete startup process for Hadoop three-node cluster: format NameNode, start HDFS and YARN, verify cluster status via Web UI, including start-dfs.sh and start-yarn.

6/30/2024

Basic Environment Setup: Hadoop Cluster

This article is migrated from Juejin. Original link: Big Data 01 - Basic Environment Setup

6/28/2024

Hadoop Cluster XML Configuration Details

Detailed explanation of Hadoop cluster three-node XML configuration files: core-site.xml, hdfs-site.xml, mapred-site.xml, yarn-site.

6/28/2024