This is article 21 in the Big Data series. Introduces Sqoop data migration ETL tool principles and installation configuration.
Complete illustrated version: CSDN Original | Juejin
What is Sqoop
Apache Sqoop (SQL to Hadoop) is a batch ETL tool specifically designed for efficiently transferring data between relational databases (MySQL, Oracle, etc.) and Hadoop ecosystem (HDFS, Hive, HBase).
Note: Apache Sqoop was moved to Attic in June 2021 and is no longer actively maintained. Still valuable for learning ETL concepts and legacy system integration; new projects should evaluate alternatives like Flink CDC and DataX.
Core capabilities:
- Import relational database data to HDFS / Hive / HBase
- Export Hadoop analysis results back to RDBMS
- Automatically converted to MapReduce programs, supporting parallel transfer with multiple Mappers
How It Works
After receiving import/export commands, Sqoop reads database metadata, splits tasks to multiple Mappers for parallel execution. Each Mapper connects to database via JDBC, responsible for pulling one data slice, finally aggregated and written to HDFS or Hive.
MySQL → Sqoop → MapReduce Job (Multiple Mappers) → HDFS / Hive
ETL Three Stages
| Stage | Description |
|---|---|
| Extract | Pull raw data via JDBC/API/filesystem |
| Transform | Data cleaning, format standardization, business rules application |
| Load | Write to HDFS, Hive, or other target systems |
Installation Configuration
1. Download and Extract
Use final stable version 1.4.7:
tar zxvf sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz
mv sqoop-1.4.7.bin__hadoop-2.6.0/ /opt/servers/sqoop-1.4.7/
2. Configure Environment Variables
export SQOOP_HOME=/opt/servers/sqoop-1.4.7
export PATH=$PATH:$SQOOP_HOME/bin
3. Configure sqoop-env.sh
export HADOOP_COMMON_HOME=/opt/servers/hadoop-2.9.2
export HADOOP_MAPRED_HOME=/opt/servers/hadoop-2.9.2
export HIVE_HOME=/opt/servers/hive-2.3.9
4. Copy Dependent JARs
# MySQL JDBC driver
cp $HIVE_HOME/lib/mysql-connector-java-8.0.19.jar $SQOOP_HOME/lib
# Hive related libraries
cp $HIVE_HOME/lib/hive-*-2.3.9.jar $SQOOP_HOME/lib/
# JSON utility library
cp $HADOOP_HOME/share/hadoop/tools/lib/json-20170516.jar $SQOOP_HOME/lib
5. Verify Installation
sqoop version
Normal output of version information indicates successful installation.
Quick Examples
MySQL → HDFS Import:
sqoop import \
--connect jdbc:mysql://h122.wzk.icu:3306/mydb \
--username root \
--password 123456 \
--table employees \
--target-dir /user/hadoop/employees \
--num-mappers 4
HDFS → MySQL Export:
sqoop export \
--connect jdbc:mysql://h122.wzk.icu:3306/mydb \
--username root \
--password 123456 \
--table employees \
--export-dir /user/hive/warehouse/employees
Modern Alternatives
Sqoop is no longer maintained. New projects should consider:
- Apache Flink CDC — Real-time incremental sync based on binlog
- DataX — Alibaba open source, supports rich data sources
- Apache NiFi — Visual data flow orchestration
- Debezium — CDC solution in Kafka Connect ecosystem
- SeaTunnel — Modern data integration framework with batch and streaming
Next article will demonstrate full process of Sqoop importing MySQL data to HDFS.