This is article 23 in the Big Data series. Introduces three practical ways Sqoop imports partial data from MySQL to HDFS by condition.
Complete illustrated version: CSDN Original | Juejin
Why Partial Import is Needed
Full import is useful for initialization, but production often only needs:
- Specific columns (reduce transfer volume)
- Rows meeting conditions (e.g., products with price > threshold)
- Result set after multi-table JOIN
Sqoop provides three ways to achieve partial import.
Method 1: —query Custom Query
Suitable for complex SQL filtering or multi-table JOIN:
sqoop import \
--connect jdbc:mysql://h122.wzk.icu:3306/sqoop \
--username hive \
--password hive@wzk.icu \
--target-dir /root/sqoop_query \
--append \
-m 1 \
--fields-terminated-by "\t" \
--query 'select gname, serialNumber, price, stock_number, create_time
from goodtbl where price > 88 and $CONDITIONS;'
Core Precautions:
- WHERE clause must include
$CONDITIONSplaceholder, Sqoop uses it to inject split conditions - When using single quotes,
$CONDITIONSdoesn’t need escaping; when using double quotes, write as\$CONDITIONS --appendmeans append data to existing directory instead of overwriting
Method 2: —columns Specify Columns
Simplest way when only importing partial columns:
sqoop import \
--connect jdbc:mysql://h122.wzk.icu:3306/sqoop \
--username hive \
--password hive@wzk.icu \
--target-dir /root/sqoop_columns \
--delete-target-dir \
--num-mappers 1 \
--fields-terminated-by "\t" \
--columns gname,serialNumber,price \
--table goodtbl
Note: Multiple columns separated by commas, no spaces, otherwise parsing fails.
Method 3: —where Condition Filter
Use standard SQL WHERE syntax, combined with --table, most intuitive:
sqoop import \
--connect jdbc:mysql://h122.wzk.icu:3306/sqoop \
--username hive \
--password hive@wzk.icu \
--target-dir /root/sqoop_where \
--delete-target-dir \
-m 1 \
--fields-terminated-by "\t" \
--table goodtbl \
--where "price >= 68"
Comparison of Three Methods
| Method | Applicable Scenario | Multi-table JOIN | Column Filter | Row Filter |
|---|---|---|---|---|
--query | Complex SQL, multi-table JOIN | Yes | Yes | Yes |
--columns | Only partial columns needed | No | Yes | No |
--where | Simple row filter | No | No | Yes |
--columnsand--wherecan be combined, but cannot be used with--query.
Choice Between —append and —delete-target-dir
| Parameter | Behavior | Applicable Scenario |
|---|---|---|
--append | Append files to existing directory | Incremental scenario, don’t overwrite historical data |
--delete-target-dir | Delete target directory first then write | Full overwrite, ensure idempotence |
| No parameter | Error and exit if directory exists | Default protection against mistakes |
Verify Results
# View HDFS files
hdfs dfs -ls /root/
# Check data content
hdfs dfs -cat /root/sqoop_query/part-m-00000
After successful import, can view MapReduce Job execution details in YARN Web UI (http://h121.wzk.icu:8088), confirm Mapper task allocation and time consumption.
Next article introduces Sqoop importing MySQL data directly to Hive table, and exporting Hive data back to MySQL bidirectional operations.