Big Data 263 - Canal Workflow

Working Principle

Canal is an open-source tool for MySQL database binlog incremental subscription and consumption. It is primarily used to solve data synchronization and distributed transaction problems, supporting the synchronization of database changes to other systems such as message queues and big data platforms.

The replication process is divided into three steps:

  • The Master writes change records to the binary log
  • The Slave sends a DUMP protocol to the MySQL Master, copying binary log events to its relay log
  • The Slave reads and replays relay log events to sync changes to its own database

Core Concepts

Binlog (Binary Log)

MySQL’s binary log records all events of data changes in the database, such as INSERT, UPDATE, DELETE operations. Canal captures these data changes by parsing the binlog.

Master-Slave Replication Mechanism

MySQL supports master-slave replication. The master transmits the binlog to the slave, and the slave replays the operations based on the binlog to achieve data synchronization. Canal simulates the behavior of a MySQL slave and receives binlog data by masquerading as a slave.

Dump and Parse

  • Dump phase: Obtains binlog data stream by simulating the MySQL Slave protocol.
  • Parse phase: Parses the binlog data and converts it into a consumable data structure.

Event and Entry Model

Canal converts binlog data into an event stream (Event), containing specific SQL operations and the affected table and data content.

Workflow

Masquerade as Slave and Connect to Master

Canal simulates a MySQL slave, executes the COM_REGISTER_SLAVE command to register with the master, and requests to pull binlog from a certain position. The connection process is similar to the MySQL master-slave replication protocol.

Pull Binlog Data

Canal sends the BINLOG_DUMP command to obtain the binlog stream from the master.

Parse Binlog Data

After receiving the binary log, Canal parses it according to the MySQL binlog format, restoring it to SQL operations or data change events.

Format Event Data

Canal converts the parsed binlog data into internal Event objects. Events contain table name, primary key, operation type (insert, update, delete), and specific data content.

Data Delivery and Consumption

Canal provides the parsed Events to consumers, such as message queues like Kafka and RocketMQ, or directly writes to storage systems like HDFS and Elasticsearch. Consumers can perform real-time processing or analysis as needed.

Detailed Data Flow Process

Simulate Slave to Sync Binlog

  • Canal simulates a Slave and sends COM_BINLOG_DUMP request to the master.
  • After receiving the request, the master starts sending binlog data to Canal.
  • Canal records the binlog position information for checkpoint resume.

Binlog Data Parsing

Canal parses various event types according to MySQL’s binlog format:

  • Query Event: SQL statement events, such as DDL operations (CREATE TABLE, etc.)
  • Row Event: Row-level operation events, specifically including:
    • Write Rows (INSERT)
    • Update Rows (UPDATE)
    • Delete Rows (DELETE)
  • Canal converts these events into a common Entry format, encapsulating specific row data and metadata.

Delivery and Consumption Processing

  • The event data provided by Canal will be pushed to message queues (Kafka, RabbitMQ) or storage systems (HDFS, Elasticsearch).
  • Downstream applications obtain data from these queues or storage for real-time analysis, synchronization, indexing, and other operations.

Binlog Message Structure

Canal converts binlog into a unified message format, mainly including the following parts:

  • Header: Describes message metadata, such as database name, table name, operation type, etc.
  • Row Data: Records table row change information, including primary key and column data (old values, new values)
  • Transaction Info: Supports transaction begin and end flags to ensure data consistency.

MySQL Binlog

Binlog Introduction

MySQL’s binary log can be considered the most important log in MySQL. It records all DDL and DML (except data query statements) in event form, also including the execution time consumed by statements. MySQL’s binary log is transaction-safe.

Generally, enabling the binary log incurs about 1% performance loss. There are two most important scenarios for binary logs:

  • MySQL Replication: Enable Binlog on the Master side, and the Master transmits its binary log to Slaves to achieve Master-Slave data consistency.
  • Data recovery: Recover data using the MySQLBinlog tool.

Binary files include two types of files: binary log index file (file suffix is .index) to record all binary files, binary log file (filename suffix .0000*) to record all DDL and DML (except query statements) statement events in the database.

Enable Binlog

In the MySQL configuration file, add under the [mysqld] section:

log-bin=mysql-bin

Binlog Format Settings

MySQL binlog has three formats: STATEMENT, MIXED, and ROW.

statement

Statement-level, binlog records every write operation statement executed. Compared to row mode, it saves space, but may produce inconsistencies. For example: update tt set create_date=now(), when using binlog to recover, the result will be different because the execution time points are different. Advantages: Saves space Disadvantages: May cause data inconsistency

row

Row-level, binlog records changes to each row after every operation. Advantages: Maintains absolute consistency, because regardless of what the SQL is and what functions it references, it only records the effect after execution. Disadvantages: Occupies more space

mixed

Statement upgrade version, to some extent solves the inconsistency problem caused by certain situations in statement mode. For example, in certain situations:

  • When the function contains UUID
  • When a table with AUTO_INCREMENT field is updated
  • When executing INSET DELAYED statements
  • When using UDF, it processes according to row mode

Advantages: Saves space while ensuring some consistency Disadvantages: There are still very rare cases that can cause inconsistency, and both statement and mixed are inconvenient for binlog monitoring requirements.