MySQL MHA Master-Slave Failover: Architecture and Impleme...

MHA Architecture Deep Dive

MHA Basic Concepts

MHA (Master High Availability) is a mature MySQL high availability solution developed by Yoshinori Matsunobu from Japan’s DeNA company. It is primarily used to implement automated failover and master-slave switching in MySQL master-slave architecture.

Core Features

Fast Failover:
- Automatically completes master failover within 30 seconds
- Typical failover time is between 10-30 seconds
- Supports manually triggered master switch (planned switch), usually taking only 0.5-2 seconds
Data Consistency Guarantee:
- Ensures data consistency by comparing relay log differences between slaves
- Supports automatically fetching binlog from old master for data repair
Flexible Deployment:
- Supports standard master-slave replication architecture
- Compatible with GTID-based replication mode

Typical Application Scenarios

Financial Trading Systems: Payment systems requiring zero data loss
E-commerce Platforms: Order processing systems during promotions, product inventory management systems
Social Networking Services: User relationship storage systems, instant messaging storage systems

Working Principles

Monitoring Phase: MHA Manager checks master availability through periodic pings
Fault Detection: When master is unreachable, initiates secondary confirmation mechanism
Failover Process: Identifies the slave with data closest to master, applies differential relay log to ensure data consistency

Components

1. MHA Manager (Management Node)

MHA Manager is the control center of the entire architecture, responsible for monitoring and managing the entire MySQL master-slave replication cluster.

Flexible deployment: Can be independently deployed on a dedicated management server or on a Slave node
Main responsibilities:
- Continuously monitors Master node health status
- Automatically triggers and controls failover when Master fails
- Real-time checks MySQL replication status and lag
Typical workflow:
1. Checks Master reachability every 3 seconds (configurable)
2. Determines Master failure after 3 consecutive failed checks
3. Automatically selects the Slave with the latest data as new Master

2. MHA Node (Data Node)

MHA Node is an agent program running on each MySQL instance.

Core functions:
- On Master node: Real-time saves and transfers binary logs (binlog)
- On Slave node: Receives and applies relay logs
- Precisely identifies log difference points between Slaves
- Applies missing log events to lagging Slaves

Fault Handling

MHA Fault Handling Mechanism Deep Dive

Save binary logs from crashed master
- After automatically detecting Master crash, immediately saves complete binlog from Master server
- Connects to original Master server via SSH and transfers unsynchronized binlog to management node
Locate the latest slave
- Compares all Slave’s SHOW SLAVE STATUS information
- Determines which Slave has data closest to Master based on Exec_Master_Log_Pos and Relay_Log_Pos
Repair other slaves
- Gets relay log information from the latest Slave
- Uses mysqlbinlog tool to parse and replay relay log to other lagging Slaves
Master-slave switch operation
- Executes STOP SLAVE; RESET MASTER; on the latest Slave
- Modifies my.cnf configuration, enables log-bin and other master-required parameters
- Promotes to new Master through CHANGE MASTER TO command

Rebuild replication topology

STOP SLAVE;
CHANGE MASTER TO MASTER_HOST='new_master_ip';
START SLAVE;

Primary-Standby Switch

Primary-standby switch refers to the process of converting a standby database to primary and the original primary to standby.

Primary-Standby Switch Strategies

Reliability-First Strategy: Ensures data consistency as highest priority
- Typical implementation steps: a) Stop writes to primary b) Wait for standby to catch up with primary (Seconds_Behind_Master=0) c) Promote standby to new primary d) Enable writes on new primary
- Applicable scenarios: Financial trading, order systems and other businesses with strict data consistency requirements
Availability-First Strategy: Ensures service availability as highest priority
- Typical implementation steps: a) Directly promote standby to new primary b) Allow new primary to immediately accept writes c) Original primary catches up with data
- Applicable scenarios: Social networks, content platforms and other businesses prioritizing service continuity

Primary-Standby Lag

Primary-standby lag is a performance indicator caused by master-slave database synchronization delay.

Key time points:

T1: Primary A completes transaction execution and writes to binlog
T2: Standby B completely receives that binlog
T3: Standby B executes the transaction in that binlog

Primary-standby lag calculation formula: Lag time = T3 - T1

Important fields:

Seconds_Behind_Master: Represents current standby lag in seconds
Relay_Log_Pos: Current binlog position being executed by standby
Master_Log_File: Master binlog file being read by standby

Lag Causes

Standby machine performance issues
- Insufficient hardware configuration
- Resource overload: One machine simultaneously serves as standby for multiple masters
- Network bottleneck
Division issues
- Read operation load: Standby bears many read requests from applications
- Background task interference
Large transaction operations
- Single deletion of large amounts of data
- Large table structure changes (such as adding indexes to tables with tens of millions of rows)
- Batch data import

Reliability-First vs Availability-First

Reliability-First Process:

Check the seconds_behind_master value of Slave B
Change Primary A to read-only state (readonly=true)
Wait for Slave B’s seconds_behind_master value to drop to 0
Change Slave B to writable state (readonly=false)
Switch business requests to Slave B

Availability-First:

Don’t wait for master-slave sync to complete, directly switch business requests to Slave B
Almost no downtime, but causes data inconsistency

In most cases, prioritize reliability strategy. While meeting data reliability, MySQL’s availability depends on the size of synchronization delay - the smaller the delay, the higher the reliability.

Summary

Core advantages of MHA as MySQL high availability solution:

Fast automatic failover: Completes automatically within 10-30 seconds
Data consistency guarantee: Automatically identifies and applies latest binlog events
Excellent performance: Supports multiple replication modes
Centralized monitoring management: Can manage multiple MySQL clusters

Limitations:

Requires at least one slave to work
Brief write interruption during failover
Requires additional Manager node for monitoring management

Currently MHA mainly supports one-master-multi-slave architecture. To build MHA, a replication cluster must have a minimum of three database servers.

Update Loss Problem

Async replication causes update overwrite

MMM Architecture

Overview

MMM (Master-Master Replication Manager for MySQL) is an open-source MySQL high availability solution.

Core Functions

Monitor MySQL master-slave replication status
Automatically detect node failures and perform failover
Manage virtual IP addresses for transparent client switching

Node Roles

Writer node: Handles all write operations
Reader node: Handles read requests

Fault Handling Process

VIP removal
Role switching
Topology reconfiguration

Monitoring Mechanism

Monitor Server

Real-time collection of node health status indicators
Determines node faults through heartbeat detection

Agent

Runs on each MySQL server
Monitors local MySQL service status
Executes management instructions from Monitor