Sqoop Practice: MySQL Full Data Import to HDFS

This is article 22 in the Big Data series. Demonstrates complete operation flow of Sqoop importing MySQL table data to HDFS.

Complete illustrated version: CSDN Original | Juejin

Experimental Environment

Three 2C4G public cloud servers (h121, h122, h123) forming Hadoop cluster, with h122 running MySQL service. HDFS, YARN, and Hive services are all ready.

Prepare Test Data

Create test table in MySQL and batch insert data:

CREATE DATABASE sqoop;
USE sqoop;

CREATE TABLE goodtbl (
  gname        VARCHAR(50),
  serialNumber INT,
  price        INT,
  stock_number INT,
  create_time  DATE
);

-- Use stored procedure to batch insert 100 test records
CALL batchInsertTestData(1, 100);

Full Import Command

sqoop import \
  --connect jdbc:mysql://h122.wzk.icu:3306/sqoop \
  --username hive \
  --password hive@wzk.icu \
  --table goodtbl \
  --target-dir /user/hadoop/goodtbl \
  --delete-target-dir \
  --num-mappers 4 \
  --fields-terminated-by "\t"

Key Parameter Explanations

Parameter	Description
`--connect`	JDBC connection string
`--username` / `--password`	Database credentials
`--table`	Source table name
`--target-dir`	HDFS target directory
`--delete-target-dir`	Delete target directory first if exists, avoid conflicts
`--num-mappers`	Parallel Mapper count (default 4)
`--fields-terminated-by`	Output field delimiter

MapReduce Parallel Mechanism

Sqoop import auto-generates MapReduce Job, multiple Mappers work in parallel:

Read source table primary key range (min / max)
Divide range evenly into N intervals (N = --num-mappers)
Each Mapper pulls corresponding interval data slice via JDBC
Each Mapper output file written to HDFS target directory

For primary key 1-100 with 4 Mappers, splits are: 1-25, 26-50, 51-75, 76-100.

If table has no primary key or uneven distribution, use --split-by to specify split column:

sqoop import \
  --connect jdbc:mysql://h122.wzk.icu:3306/sqoop \
  --username hive \
  --password hive@wzk.icu \
  --table goodtbl \
  --target-dir /user/hadoop/goodtbl \
  --split-by serialNumber \
  --num-mappers 4 \
  --fields-terminated-by "\t"

Custom Query Import

When needing multi-table JOIN or complex filtering, use --query instead of --table:

sqoop import \
  --connect jdbc:mysql://h122.wzk.icu:3306/sqoop \
  --username hive \
  --password hive@wzk.icu \
  --query "SELECT gname, price, create_time FROM goodtbl WHERE \$CONDITIONS" \
  --target-dir /user/hadoop/goodtbl_filtered \
  --split-by serialNumber \
  --num-mappers 2 \
  --fields-terminated-by "\t"

Note: When using --query, WHERE clause must include $CONDITIONS placeholder (escaped as \$CONDITIONS in double quotes), Sqoop will automatically replace with split conditions.

Verify Import Results

# View generated files on HDFS
hdfs dfs -ls /user/hadoop/goodtbl

# View data content (first 5 lines)
hdfs dfs -cat /user/hadoop/goodtbl/part-m-00000 | head -5

Typical output (tab-separated):

商品A    1    88    100    2024-01-01
商品B    2    66    200    2024-01-02

Common Issues

Q: Import error Target directory already exists A: Add --delete-target-dir parameter, or execute hdfs dfs -rm -r /target/path in advance.

Q: Mapper count 1 or multiple? A: Use -m 1 for small tables to avoid unnecessary parallel overhead; increase for large tables based on cluster resources, typically 4-8 Mappers work well.

Q: Password plaintext security risk? A: Can use --password-file parameter to specify file containing password on HDFS, avoid plaintext exposure in command history.

Next article introduces Sqoop partial import: conditional filtering via --query, --columns, and --where three methods.