Sqoop Incremental Import and CDC Change Data Capture Prin...

This is article 25 in the Big Data series. Introduces Sqoop incremental data import practice, and extends to explain CDC (Change Data Capture) principles and modern solutions.

Complete illustrated version: CSDN Original | Juejin

Necessity of Incremental Import

Full import is suitable for initialization, but in production databases have new records every day. If scanning full each time:

Waste bandwidth and compute resources
Add extra read pressure on source database
High latency, cannot approach real-time

Sqoop provides --incremental append mode, only imports new data after last sync.

Sqoop Incremental Import Practice

Step 1: Initialize Baseline Data

-- Empty table, re-insert 100 records
TRUNCATE TABLE sqoop.goodtbl;
TRUNCATE TABLE mydb.goodtbl;
CALL batchInsertTestData(1, 100);

Step 2: Import First 50 to Hive (Simulate Historical Data)

sqoop import \
  --connect jdbc:mysql://h122.wzk.icu:3306/sqoop \
  --username hive \
  --password hive@wzk.icu \
  --table goodtbl \
  --incremental append \
  --hive-import \
  --fields-terminated-by "\t" \
  --hive-table mydb.goodtbl \
  --check-column serialNumber \
  --last-value 50 \
  -m 1

Verify Hive should have 50 records:

SELECT COUNT(*) FROM mydb.goodtbl;
-- Result: 50

Step 3: Incrementally Sync Remaining Data

sqoop import \
  --connect jdbc:mysql://h122.wzk.icu:3306/sqoop \
  --username hive \
  --password hive@wzk.icu \
  --table goodtbl \
  --incremental append \
  --hive-import \
  --fields-terminated-by "\t" \
  --hive-table mydb.goodtbl \
  --check-column serialNumber \
  --last-value 100 \
  -m 1

Incremental Parameter Explanations

Parameter	Description
`--incremental append`	Incremental mode, suitable for append-only columns (like auto-increment primary key)
`--check-column`	Column used to determine new data (usually auto-increment ID or timestamp)
`--last-value`	Last sync max value, this import only records > this value

--incremental lastmodified suitable for scenarios with UPDATE operations, uses timestamp column to track changes.

CDC: Change Data Capture

Sqoop incremental import is essentially Query-Based CDC. Understanding the complete CDC system helps choose appropriate tools.

Core Value of CDC

Compress batch ETL latency from hours to seconds
Incremental capture replaces full scan, significantly reduces source database load
Support zero-downtime data migration and hot backup
Provide event-driven architecture foundation for microservices

Comparison of Four Capture Methods

Method	Principle	Advantages	Disadvantages
Query Comparison	Poll timestamp/sequence column	Simple, non-intrusive	Polling latency, full table scan overhead
Triggers	DDL triggers write to change table	Real-time capture	Intrusive to database, complex DDL maintenance
Log Parsing	Parse binlog/WAL/redo log	Almost zero additional load, recommended	Requires database config permissions
Snapshot Comparison	Periodic full snapshot diff	No special permissions	High storage overhead, high latency

Recommended Method: Log-Based CDC

Parse MySQL binlog or PostgreSQL WAL, almost no additional source database load, latency can reach millisecond level.

Mainstream Open Source CDC Solutions

Tool	Features	Applicable Scenarios
Flink CDC	Based on Flink Runtime, supports Incremental Snapshot, built-in MySQL/PG/Oracle/MongoDB connectors	Real-time data warehouse, unified batch and streaming
Debezium	Kafka Connect plugin, covers 10+ data sources	Kafka ecosystem, event sourcing
Canal	Alibaba open source, lightweight MySQL binlog parsing	Pure MySQL scenario, simple and easy to use
Maxwell	Lightweight MySQL CDC, outputs JSON	Simple scenarios, quick integration
SeaTunnel	Unified batch and streaming data integration, supports visual config	Multi-data source integration platform

Production Design Points

Initial Snapshot vs Incremental Sync

Small tables (< 1GB): One-time full snapshot
TB-level large tables: Use Flink CDC Incremental Snapshot, lockless concurrent reading, doesn’t block business writes

Message Encoding

Recommend row-level JSON/Avro format, use Schema Registry to manage Schema evolution, avoid downstream parsing failures.

Exactly-Once Semantics

Kafka + Flink two-phase commit (2PC) + Checkpoint mechanism ensures end-to-end exactly-once processing.

Monitoring Metrics

End-to-end latency = Source DB LSN - Sink committed offset

Key alerts: binlog consumption backlog, Sink write failure rate, Schema mismatch errors.

Common Challenges

Write Amplification: Peak write time Kafka partition hotspots, need reasonable partition key
Schema Drift: Source table DDL changes may cause downstream parsing failures, need Schema compatibility strategy
Historical Data Backfill: History data before CDC launch needs separate full initialization
Clock Skew: Global order difficult to guarantee when merging multi-source data, need business-level idempotent processing

Thus, Sqoop series (Big Data 21-25) comes to an end. After completing Sqoop, recommend exploring Flink CDC for real-time incremental sync to meet low-latency requirements of modern data engineering.