This is article 25 in the Big Data series. Introduces Sqoop incremental data import practice, and extends to explain CDC (Change Data Capture) principles and modern solutions.
Complete illustrated version: CSDN Original | Juejin
Necessity of Incremental Import
Full import is suitable for initialization, but in production databases have new records every day. If scanning full each time:
- Waste bandwidth and compute resources
- Add extra read pressure on source database
- High latency, cannot approach real-time
Sqoop provides --incremental append mode, only imports new data after last sync.
Sqoop Incremental Import Practice
Step 1: Initialize Baseline Data
-- Empty table, re-insert 100 records
TRUNCATE TABLE sqoop.goodtbl;
TRUNCATE TABLE mydb.goodtbl;
CALL batchInsertTestData(1, 100);
Step 2: Import First 50 to Hive (Simulate Historical Data)
sqoop import \
--connect jdbc:mysql://h122.wzk.icu:3306/sqoop \
--username hive \
--password hive@wzk.icu \
--table goodtbl \
--incremental append \
--hive-import \
--fields-terminated-by "\t" \
--hive-table mydb.goodtbl \
--check-column serialNumber \
--last-value 50 \
-m 1
Verify Hive should have 50 records:
SELECT COUNT(*) FROM mydb.goodtbl;
-- Result: 50
Step 3: Incrementally Sync Remaining Data
sqoop import \
--connect jdbc:mysql://h122.wzk.icu:3306/sqoop \
--username hive \
--password hive@wzk.icu \
--table goodtbl \
--incremental append \
--hive-import \
--fields-terminated-by "\t" \
--hive-table mydb.goodtbl \
--check-column serialNumber \
--last-value 100 \
-m 1
Incremental Parameter Explanations
| Parameter | Description |
|---|---|
--incremental append | Incremental mode, suitable for append-only columns (like auto-increment primary key) |
--check-column | Column used to determine new data (usually auto-increment ID or timestamp) |
--last-value | Last sync max value, this import only records > this value |
--incremental lastmodifiedsuitable for scenarios withUPDATEoperations, uses timestamp column to track changes.
CDC: Change Data Capture
Sqoop incremental import is essentially Query-Based CDC. Understanding the complete CDC system helps choose appropriate tools.
Core Value of CDC
- Compress batch ETL latency from hours to seconds
- Incremental capture replaces full scan, significantly reduces source database load
- Support zero-downtime data migration and hot backup
- Provide event-driven architecture foundation for microservices
Comparison of Four Capture Methods
| Method | Principle | Advantages | Disadvantages |
|---|---|---|---|
| Query Comparison | Poll timestamp/sequence column | Simple, non-intrusive | Polling latency, full table scan overhead |
| Triggers | DDL triggers write to change table | Real-time capture | Intrusive to database, complex DDL maintenance |
| Log Parsing | Parse binlog/WAL/redo log | Almost zero additional load, recommended | Requires database config permissions |
| Snapshot Comparison | Periodic full snapshot diff | No special permissions | High storage overhead, high latency |
Recommended Method: Log-Based CDC
Parse MySQL binlog or PostgreSQL WAL, almost no additional source database load, latency can reach millisecond level.
Mainstream Open Source CDC Solutions
| Tool | Features | Applicable Scenarios |
|---|---|---|
| Flink CDC | Based on Flink Runtime, supports Incremental Snapshot, built-in MySQL/PG/Oracle/MongoDB connectors | Real-time data warehouse, unified batch and streaming |
| Debezium | Kafka Connect plugin, covers 10+ data sources | Kafka ecosystem, event sourcing |
| Canal | Alibaba open source, lightweight MySQL binlog parsing | Pure MySQL scenario, simple and easy to use |
| Maxwell | Lightweight MySQL CDC, outputs JSON | Simple scenarios, quick integration |
| SeaTunnel | Unified batch and streaming data integration, supports visual config | Multi-data source integration platform |
Production Design Points
Initial Snapshot vs Incremental Sync
- Small tables (< 1GB): One-time full snapshot
- TB-level large tables: Use Flink CDC Incremental Snapshot, lockless concurrent reading, doesn’t block business writes
Message Encoding
Recommend row-level JSON/Avro format, use Schema Registry to manage Schema evolution, avoid downstream parsing failures.
Exactly-Once Semantics
Kafka + Flink two-phase commit (2PC) + Checkpoint mechanism ensures end-to-end exactly-once processing.
Monitoring Metrics
End-to-end latency = Source DB LSN - Sink committed offset
Key alerts: binlog consumption backlog, Sink write failure rate, Schema mismatch errors.
Common Challenges
- Write Amplification: Peak write time Kafka partition hotspots, need reasonable partition key
- Schema Drift: Source table DDL changes may cause downstream parsing failures, need Schema compatibility strategy
- Historical Data Backfill: History data before CDC launch needs separate full initialization
- Clock Skew: Global order difficult to guarantee when merging multi-source data, need business-level idempotent processing
Thus, Sqoop series (Big Data 21-25) comes to an end. After completing Sqoop, recommend exploring Flink CDC for real-time incremental sync to meet low-latency requirements of modern data engineering.