Big Data 262 - Canal Sync Intro

Canal Sync Business Data

Environment Preparation

Hadoop
HBase
Flink
ClickHouse
MySQL
Canal
Kafka

Canal Introduction

Alibaba B2B company, due to business characteristics, sellers are mainly concentrated domestically while buyers are mainly overseas, resulting in the demand for Hangzhou and US cross-region data centers. Starting from 2010, Alibaba companies gradually began attempting to parse database logs to obtain incremental changes for synchronization, giving rise to incremental subscription and consumption business.

Canal is a Java-based middleware for parsing database incremental logs, providing incremental data subscription and consumption. Currently, Canal primarily supports MySQL Binlog parsing. After parsing is complete, Canal Client is used to process the obtained data. (The database synchronization case requires the otter middleware, based on Canal).

Canal’s Background and Original Purpose

In large-scale internet applications, especially on e-commerce platforms like Alibaba, database scale and complexity are typically very high. To improve data real-time performance and availability, many business scenarios require real-time synchronization of database changes (such as orders, user data, etc.) while ensuring data consistency across multiple systems.

The birth of Canal was precisely to solve this problem. Through Change Data Capture (CDC) technology, Canal can efficiently monitor and capture database data changes (such as insert, update, delete operations), then synchronize these changes in real-time to downstream systems such as search engines, data warehouses, caches, and more.

Canal’s Core Functions

Change Data Capture (CDC)

Canal’s core function is implementing change data capture. It obtains changed data from the database by parsing database logs (binlog). Canal supports MySQL, PostgreSQL, Oracle, and other databases, capable of capturing database changes in real-time.

Incremental data synchronization: By parsing binlog, Canal can identify incremental database changes (insert, update, delete, etc.) and synchronize these changes to downstream systems.
Efficient processing: Canal efficiently captures and synchronizes data changes, supports large-scale distributed deployment, and can handle high-concurrency, high-throughput scenarios.

Data Synchronization

Canal can not only capture data changes but also synchronize these data changes to other systems. Common application scenarios include:

Synchronization to search engines: For example, synchronizing MySQL data changes to Elasticsearch in real-time to improve search real-time performance.
Synchronization to data warehouses: Achieve real-time data synchronization in OLAP (Online Analytical Processing) systems, such as synchronizing database changes to Hadoop, ClickHouse, and other big data platforms.
Cache synchronization: After database changes, Canal can update caches (such as Redis) in real-time to ensure cache data is the latest.

Heterogeneous Database Support

Canal not only supports traditional relational databases like MySQL, PostgreSQL, Oracle but also supports more database access through extensions and plugins, adapting to synchronization requirements between different databases.

Data Parsing and Transformation

Canal allows parsing and transforming captured data. Users can customize processing of captured data based on APIs provided by Canal, such as data format conversion, filtering, etc.

Canal’s Working Principle

Canal’s working principle can be divided into the following steps:

Connect to database: Canal connects to the source database and obtains the database’s binlog (for MySQL, it’s binary log; other databases have similar mechanisms). Binlog records all data change operations in the database.
Parse binlog: Canal continuously monitors and parses the database’s binlog files. Whenever there is new data change, Canal reads the new binlog and parses out the specific change content (insert, update, delete, etc.).
Data processing: Canal processes the captured changed data through data parsing, then synchronizes the data to downstream systems (such as message queues, caches, search engines, data warehouses, etc.) according to configuration and requirements.
Persistence and monitoring: Canal also provides persistence mechanisms to save the synchronization status of already synchronized data, along with monitoring capabilities to promptly detect system abnormalities and trigger alert processing.

Canal’s Advantages

Real-time: Canal can capture database changes very quickly, suitable for scenarios requiring real-time data synchronization.
High throughput: It can handle large volumes of data changes, suitable for high-concurrency, high-throughput scenarios.
Flexibility: Canal provides rich extension interfaces and plugin support, allowing users to customize processing according to their needs.
High availability: Canal supports distributed deployment and high-availability architecture, enabling load balancing and failover between multiple nodes to ensure system stability.

Canal Deployment and Usage

Canal deployment and usage is not complicated. Users only need to configure Canal’s connection information to the database and define target systems, and Canal can begin capturing and synchronizing data in real-time. It supports standalone mode and cluster mode, and users can choose different deployment methods based on actual throughput and availability requirements.

Configuration

Users need to configure the following:

Database connection information: Including database IP, port, username, password, etc.
Binlog configuration: Specify which binlog files to monitor.
Target system configuration: Configure which system to synchronize changed data to, such as Kafka, Elasticsearch, Redis, etc.

Startup and Monitoring

Canal provides simple command-line tools to start the service, and the running status of Canal can be monitored in real-time through a web interface or monitoring systems (such as Prometheus).

Usage Scenarios

Original Scenario

Part of Alibaba’s otter middleware

Data synchronization: Achieve data synchronization between different databases to ensure data consistency.
Real-time data warehouse: Synchronize source database changes to the data warehouse in real-time for data analysis and report display.
Real-time search engine update: Synchronize database changes to search engines in real-time to improve search real-time performance and accuracy.
Cache update: When database changes occur, promptly update cache data to avoid cache penetration and cache inconsistency issues.

Update Cache

Slowly Changing Dimension (SCD)

Order table, June 20 has 3 records: By June 21, the table has 5 records: By June 22, the table has 6 records: Design a historical slowly changing dimension table in the data warehouse to preserve this table

dw_begin_date represents the start time of the record’s lifecycle, dw_end_date represents the end time of the record’s lifecycle
dw_end_date = ‘9999-12-31’ indicates the record is valid
If querying valid records, you can directly use where dw_end_date = ‘9999-12-31’
If querying the historical snapshot of 2012-06-21, dw_begin_date <= ‘2012-06-21’ and dw_end_date >= ‘2012-06-21’

This perfectly matches the records in the original table on June 21. As you can see, such historical slowly changing dimension tables can not only meet historical data needs but also greatly save storage space.

Real-time Statistics

Capture new change data from business tables for real-time statistics.