This is article 25 in the Big Data series. Introduces Sqoop incremental data import practice, and extends to explain CDC (Change Data Capture) principles and modern solutions.

Complete illustrated version: CSDN Original | Juejin

Necessity of Incremental Import

Full import is suitable for initialization, but in production databases have new records every day. If scanning full each time:

  • Waste bandwidth and compute resources
  • Add extra read pressure on source database
  • High latency, cannot approach real-time

Sqoop provides --incremental append mode, only imports new data after last sync.

Sqoop Incremental Import Practice

Step 1: Initialize Baseline Data

-- Empty table, re-insert 100 records
TRUNCATE TABLE sqoop.goodtbl;
TRUNCATE TABLE mydb.goodtbl;
CALL batchInsertTestData(1, 100);

Step 2: Import First 50 to Hive (Simulate Historical Data)

sqoop import \
  --connect jdbc:mysql://h122.wzk.icu:3306/sqoop \
  --username hive \
  --password hive@wzk.icu \
  --table goodtbl \
  --incremental append \
  --hive-import \
  --fields-terminated-by "\t" \
  --hive-table mydb.goodtbl \
  --check-column serialNumber \
  --last-value 50 \
  -m 1

Verify Hive should have 50 records:

SELECT COUNT(*) FROM mydb.goodtbl;
-- Result: 50

Step 3: Incrementally Sync Remaining Data

sqoop import \
  --connect jdbc:mysql://h122.wzk.icu:3306/sqoop \
  --username hive \
  --password hive@wzk.icu \
  --table goodtbl \
  --incremental append \
  --hive-import \
  --fields-terminated-by "\t" \
  --hive-table mydb.goodtbl \
  --check-column serialNumber \
  --last-value 100 \
  -m 1

Incremental Parameter Explanations

ParameterDescription
--incremental appendIncremental mode, suitable for append-only columns (like auto-increment primary key)
--check-columnColumn used to determine new data (usually auto-increment ID or timestamp)
--last-valueLast sync max value, this import only records > this value

--incremental lastmodified suitable for scenarios with UPDATE operations, uses timestamp column to track changes.

CDC: Change Data Capture

Sqoop incremental import is essentially Query-Based CDC. Understanding the complete CDC system helps choose appropriate tools.

Core Value of CDC

  • Compress batch ETL latency from hours to seconds
  • Incremental capture replaces full scan, significantly reduces source database load
  • Support zero-downtime data migration and hot backup
  • Provide event-driven architecture foundation for microservices

Comparison of Four Capture Methods

MethodPrincipleAdvantagesDisadvantages
Query ComparisonPoll timestamp/sequence columnSimple, non-intrusivePolling latency, full table scan overhead
TriggersDDL triggers write to change tableReal-time captureIntrusive to database, complex DDL maintenance
Log ParsingParse binlog/WAL/redo logAlmost zero additional load, recommendedRequires database config permissions
Snapshot ComparisonPeriodic full snapshot diffNo special permissionsHigh storage overhead, high latency

Recommended Method: Log-Based CDC

Parse MySQL binlog or PostgreSQL WAL, almost no additional source database load, latency can reach millisecond level.

Mainstream Open Source CDC Solutions

ToolFeaturesApplicable Scenarios
Flink CDCBased on Flink Runtime, supports Incremental Snapshot, built-in MySQL/PG/Oracle/MongoDB connectorsReal-time data warehouse, unified batch and streaming
DebeziumKafka Connect plugin, covers 10+ data sourcesKafka ecosystem, event sourcing
CanalAlibaba open source, lightweight MySQL binlog parsingPure MySQL scenario, simple and easy to use
MaxwellLightweight MySQL CDC, outputs JSONSimple scenarios, quick integration
SeaTunnelUnified batch and streaming data integration, supports visual configMulti-data source integration platform

Production Design Points

Initial Snapshot vs Incremental Sync

  • Small tables (< 1GB): One-time full snapshot
  • TB-level large tables: Use Flink CDC Incremental Snapshot, lockless concurrent reading, doesn’t block business writes

Message Encoding

Recommend row-level JSON/Avro format, use Schema Registry to manage Schema evolution, avoid downstream parsing failures.

Exactly-Once Semantics

Kafka + Flink two-phase commit (2PC) + Checkpoint mechanism ensures end-to-end exactly-once processing.

Monitoring Metrics

End-to-end latency = Source DB LSN - Sink committed offset

Key alerts: binlog consumption backlog, Sink write failure rate, Schema mismatch errors.

Common Challenges

  1. Write Amplification: Peak write time Kafka partition hotspots, need reasonable partition key
  2. Schema Drift: Source table DDL changes may cause downstream parsing failures, need Schema compatibility strategy
  3. Historical Data Backfill: History data before CDC launch needs separate full initialization
  4. Clock Skew: Global order difficult to guarantee when merging multi-source data, need business-level idempotent processing

Thus, Sqoop series (Big Data 21-25) comes to an end. After completing Sqoop, recommend exploring Flink CDC for real-time incremental sync to meet low-latency requirements of modern data engineering.