This is article 22 in the Big Data series. Demonstrates complete operation flow of Sqoop importing MySQL table data to HDFS.
Complete illustrated version: CSDN Original | Juejin
Experimental Environment
Three 2C4G public cloud servers (h121, h122, h123) forming Hadoop cluster, with h122 running MySQL service. HDFS, YARN, and Hive services are all ready.
Prepare Test Data
Create test table in MySQL and batch insert data:
CREATE DATABASE sqoop;
USE sqoop;
CREATE TABLE goodtbl (
gname VARCHAR(50),
serialNumber INT,
price INT,
stock_number INT,
create_time DATE
);
-- Use stored procedure to batch insert 100 test records
CALL batchInsertTestData(1, 100);
Full Import Command
sqoop import \
--connect jdbc:mysql://h122.wzk.icu:3306/sqoop \
--username hive \
--password hive@wzk.icu \
--table goodtbl \
--target-dir /user/hadoop/goodtbl \
--delete-target-dir \
--num-mappers 4 \
--fields-terminated-by "\t"
Key Parameter Explanations
| Parameter | Description |
|---|---|
--connect | JDBC connection string |
--username / --password | Database credentials |
--table | Source table name |
--target-dir | HDFS target directory |
--delete-target-dir | Delete target directory first if exists, avoid conflicts |
--num-mappers | Parallel Mapper count (default 4) |
--fields-terminated-by | Output field delimiter |
MapReduce Parallel Mechanism
Sqoop import auto-generates MapReduce Job, multiple Mappers work in parallel:
- Read source table primary key range (min / max)
- Divide range evenly into N intervals (N =
--num-mappers) - Each Mapper pulls corresponding interval data slice via JDBC
- Each Mapper output file written to HDFS target directory
For primary key 1-100 with 4 Mappers, splits are: 1-25, 26-50, 51-75, 76-100.
If table has no primary key or uneven distribution, use --split-by to specify split column:
sqoop import \
--connect jdbc:mysql://h122.wzk.icu:3306/sqoop \
--username hive \
--password hive@wzk.icu \
--table goodtbl \
--target-dir /user/hadoop/goodtbl \
--split-by serialNumber \
--num-mappers 4 \
--fields-terminated-by "\t"
Custom Query Import
When needing multi-table JOIN or complex filtering, use --query instead of --table:
sqoop import \
--connect jdbc:mysql://h122.wzk.icu:3306/sqoop \
--username hive \
--password hive@wzk.icu \
--query "SELECT gname, price, create_time FROM goodtbl WHERE \$CONDITIONS" \
--target-dir /user/hadoop/goodtbl_filtered \
--split-by serialNumber \
--num-mappers 2 \
--fields-terminated-by "\t"
Note: When using
--query, WHERE clause must include$CONDITIONSplaceholder (escaped as\$CONDITIONSin double quotes), Sqoop will automatically replace with split conditions.
Verify Import Results
# View generated files on HDFS
hdfs dfs -ls /user/hadoop/goodtbl
# View data content (first 5 lines)
hdfs dfs -cat /user/hadoop/goodtbl/part-m-00000 | head -5
Typical output (tab-separated):
商品A 1 88 100 2024-01-01
商品B 2 66 200 2024-01-02
Common Issues
Q: Import error Target directory already exists
A: Add --delete-target-dir parameter, or execute hdfs dfs -rm -r /target/path in advance.
Q: Mapper count 1 or multiple?
A: Use -m 1 for small tables to avoid unnecessary parallel overhead; increase for large tables based on cluster resources, typically 4-8 Mappers work well.
Q: Password plaintext security risk?
A: Can use --password-file parameter to specify file containing password on HDFS, avoid plaintext exposure in command history.
Next article introduces Sqoop partial import: conditional filtering via --query, --columns, and --where three methods.