Sqoop Data Migration ETL Tool Introduction and Installation

This is article 21 in the Big Data series. Introduces Sqoop data migration ETL tool principles and installation configuration.

Complete illustrated version: CSDN Original | Juejin

What is Sqoop

Apache Sqoop (SQL to Hadoop) is a batch ETL tool specifically designed for efficiently transferring data between relational databases (MySQL, Oracle, etc.) and Hadoop ecosystem (HDFS, Hive, HBase).

Note: Apache Sqoop was moved to Attic in June 2021 and is no longer actively maintained. Still valuable for learning ETL concepts and legacy system integration; new projects should evaluate alternatives like Flink CDC and DataX.

Core capabilities:

Import relational database data to HDFS / Hive / HBase
Export Hadoop analysis results back to RDBMS
Automatically converted to MapReduce programs, supporting parallel transfer with multiple Mappers

How It Works

After receiving import/export commands, Sqoop reads database metadata, splits tasks to multiple Mappers for parallel execution. Each Mapper connects to database via JDBC, responsible for pulling one data slice, finally aggregated and written to HDFS or Hive.

MySQL  →  Sqoop  →  MapReduce Job (Multiple Mappers)  →  HDFS / Hive

ETL Three Stages

Stage	Description
Extract	Pull raw data via JDBC/API/filesystem
Transform	Data cleaning, format standardization, business rules application
Load	Write to HDFS, Hive, or other target systems

Installation Configuration

1. Download and Extract

Use final stable version 1.4.7:

tar zxvf sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz
mv sqoop-1.4.7.bin__hadoop-2.6.0/ /opt/servers/sqoop-1.4.7/

2. Configure Environment Variables

export SQOOP_HOME=/opt/servers/sqoop-1.4.7
export PATH=$PATH:$SQOOP_HOME/bin

3. Configure sqoop-env.sh

export HADOOP_COMMON_HOME=/opt/servers/hadoop-2.9.2
export HADOOP_MAPRED_HOME=/opt/servers/hadoop-2.9.2
export HIVE_HOME=/opt/servers/hive-2.3.9

4. Copy Dependent JARs

# MySQL JDBC driver
cp $HIVE_HOME/lib/mysql-connector-java-8.0.19.jar $SQOOP_HOME/lib

# Hive related libraries
cp $HIVE_HOME/lib/hive-*-2.3.9.jar $SQOOP_HOME/lib/

# JSON utility library
cp $HADOOP_HOME/share/hadoop/tools/lib/json-20170516.jar $SQOOP_HOME/lib

5. Verify Installation

sqoop version

Normal output of version information indicates successful installation.

Quick Examples

MySQL → HDFS Import:

sqoop import \
  --connect jdbc:mysql://h122.wzk.icu:3306/mydb \
  --username root \
  --password 123456 \
  --table employees \
  --target-dir /user/hadoop/employees \
  --num-mappers 4

HDFS → MySQL Export:

sqoop export \
  --connect jdbc:mysql://h122.wzk.icu:3306/mydb \
  --username root \
  --password 123456 \
  --table employees \
  --export-dir /user/hive/warehouse/employees

Modern Alternatives

Sqoop is no longer maintained. New projects should consider:

Apache Flink CDC — Real-time incremental sync based on binlog
DataX — Alibaba open source, supports rich data sources
Apache NiFi — Visual data flow orchestration
Debezium — CDC solution in Kafka Connect ecosystem
SeaTunnel — Modern data integration framework with batch and streaming

Next article will demonstrate full process of Sqoop importing MySQL data to HDFS.