Blog
Technical exploration and thoughts · 655 articles
Sqoop Incremental Import and CDC Change Data Capture Prin...
Introduce Sqoop's --incremental append incremental import mechanism, and deeply explain CDC (Change Data Capture) core concepts, capture method comparisons, and modern solutions like Flink CDC, Deb...
ZooKeeper Distributed Coordination Framework Introduction...
Introduction to ZooKeeper core concepts, Leader/Follower/Observer role division, ZAB protocol principles, and demonstration of 3-node cluster installation and configuration process.
Sqoop Partial Import: --query, --columns, --where Three F...
Detailed explanation of three ways Sqoop imports partial data from MySQL to HDFS by condition: custom query, specify columns, WHERE condition filtering, with applicable scenarios and precautions.
Sqoop and Hive Integration: MySQL ↔ Hive Bidirectional Da...
Demonstrates Sqoop importing MySQL data directly to Hive table, and exporting Hive data back to MySQL, covering key parameters like --hive-import, --create-hive-table usage.
Sqoop Data Migration ETL Tool Introduction and Installation
Introduction to Apache Sqoop core principles, use cases, and installation configuration steps on Hadoop cluster, helping quickly get started with batch data migration between MySQL and HDFS/Hive.
Sqoop Practice: MySQL Full Data Import to HDFS
Complete example demonstrating Sqoop importing MySQL table data to HDFS, covering core parameter explanations, MapReduce parallel mechanism, and execution result verification.
Flume Collect Hive Logs to HDFS
Use Flume exec source to real-time track Hive log files, buffer via memory channel, configure HDFS sink to write with time-based partitioning, implement automatic log data landing to HDFS.
Flume Dual Sink: Write Logs to Both HDFS and Local File
Through Flume replication mode (Replicating Channel Selector) and three-Agent cascade architecture, implement same log data written to both HDFS and local file, meeting both offline analysis and re...
Apache Flume Architecture and Core Concepts
Introduction to Apache Flume positioning, core components (Source, Channel, Sink), event model and common data flow topologies, and installation configuration methods.
Flume Hello World: NetCat Source + Memory Channel + Logge...
Through Flume's simplest Hello World case, use netcat source to monitor port, memory channel for buffering, logger sink for console output, demonstrating complete Source→Channel→Sink data flow.
Hive Metastore Three Modes and Remote Deployment
Detailed explanation of Hive Metastore's embedded, local, and remote deployment modes, and complete steps to configure high-availability remote Metastore on three-node cluster.
HiveServer2 Configuration and Beeline Remote Connection
Introduction to HiveServer2 architecture and role, configure Hadoop proxy user and WebHDFS, implement cross-node JDBC remote access to Hive via Beeline client.
Hive DDL and DML Operations
Systematic explanation of Hive DDL (database/table creation, internal and external tables) and DML (data loading, insertion, query) operations, with complete HiveQL examples and configuration optim...
Hive HQL Advanced: Data Import/Export and Query Practice
Deep dive into Hive's multiple data import methods (LOAD/INSERT/External Table/Sqoop), data export methods, and practical usage of HQL query operations like aggregation, filtering, and sorting.
MapReduce JOIN Four Implementation Strategies
Deep dive into four JOIN strategies in MapReduce: Reduce-Side Join, Map-Side Join, Semi-Join, and Bloom Join principles and Java implementations, with analysis of applicable scenarios and performan...
Hive Introduction: Architecture and Cluster Installation
Introduction to Hive data warehouse core concepts, architecture components and pros/cons, with detailed steps to install and configure Hive 2.3.9 on three-node Hadoop cluster.
HDFS Java Client Practice: Upload/Download Files, Directo...
Using Hadoop HDFS Java Client API for file operations: Maven dependency configuration, FileSystem/Path/Configuration core classes, implement file upload, download, delete, list scan and progress ba...
Java Implementation MapReduce WordCount Complete Code
Implement Hadoop MapReduce WordCount from scratch: Hadoop serialization mechanism detailed explanation, writing Mapper, Reducer, Driver three components, Maven project configuration, local and clus...
HDFS Distributed File System Read/Write Principle
Deep dive into HDFS architecture: NameNode, DataNode, Client roles, Block storage mechanism, file read/write process (Pipeline write and nearest read), and HDFS basic commands.
HDFS CLI Practice Complete Command Guide
Complete HDFS CLI practice: hadoop fs common commands including directory operations, file upload/download, permission management, with three-node cluster live demo.