MHA Architecture Deep Dive
MHA Basic Concepts
MHA (Master High Availability) is a mature MySQL high availability solution developed by Yoshinori Matsunobu from Japan’s DeNA company. It is primarily used to implement automated failover and master-slave switching in MySQL master-slave architecture.
Core Features
-
Fast Failover:
- Automatically completes master failover within 30 seconds
- Typical failover time is between 10-30 seconds
- Supports manually triggered master switch (planned switch), usually taking only 0.5-2 seconds
-
Data Consistency Guarantee:
- Ensures data consistency by comparing relay log differences between slaves
- Supports automatically fetching binlog from old master for data repair
-
Flexible Deployment:
- Supports standard master-slave replication architecture
- Compatible with GTID-based replication mode
Typical Application Scenarios
- Financial Trading Systems: Payment systems requiring zero data loss
- E-commerce Platforms: Order processing systems during promotions, product inventory management systems
- Social Networking Services: User relationship storage systems, instant messaging storage systems
Working Principles
- Monitoring Phase: MHA Manager checks master availability through periodic pings
- Fault Detection: When master is unreachable, initiates secondary confirmation mechanism
- Failover Process: Identifies the slave with data closest to master, applies differential relay log to ensure data consistency
Components
1. MHA Manager (Management Node)
MHA Manager is the control center of the entire architecture, responsible for monitoring and managing the entire MySQL master-slave replication cluster.
-
Flexible deployment: Can be independently deployed on a dedicated management server or on a Slave node
-
Main responsibilities:
- Continuously monitors Master node health status
- Automatically triggers and controls failover when Master fails
- Real-time checks MySQL replication status and lag
-
Typical workflow:
- Checks Master reachability every 3 seconds (configurable)
- Determines Master failure after 3 consecutive failed checks
- Automatically selects the Slave with the latest data as new Master
2. MHA Node (Data Node)
MHA Node is an agent program running on each MySQL instance.
- Core functions:
- On Master node: Real-time saves and transfers binary logs (binlog)
- On Slave node: Receives and applies relay logs
- Precisely identifies log difference points between Slaves
- Applies missing log events to lagging Slaves
Fault Handling
MHA Fault Handling Mechanism Deep Dive
-
Save binary logs from crashed master
- After automatically detecting Master crash, immediately saves complete binlog from Master server
- Connects to original Master server via SSH and transfers unsynchronized binlog to management node
-
Locate the latest slave
- Compares all Slave’s
SHOW SLAVE STATUSinformation - Determines which Slave has data closest to Master based on
Exec_Master_Log_PosandRelay_Log_Pos
- Compares all Slave’s
-
Repair other slaves
- Gets relay log information from the latest Slave
- Uses
mysqlbinlogtool to parse and replay relay log to other lagging Slaves
-
Master-slave switch operation
- Executes
STOP SLAVE; RESET MASTER;on the latest Slave - Modifies my.cnf configuration, enables log-bin and other master-required parameters
- Promotes to new Master through
CHANGE MASTER TOcommand
- Executes
-
Rebuild replication topology
STOP SLAVE; CHANGE MASTER TO MASTER_HOST='new_master_ip'; START SLAVE;
Primary-Standby Switch
Primary-standby switch refers to the process of converting a standby database to primary and the original primary to standby.
Primary-Standby Switch Strategies
-
Reliability-First Strategy: Ensures data consistency as highest priority
- Typical implementation steps: a) Stop writes to primary b) Wait for standby to catch up with primary (Seconds_Behind_Master=0) c) Promote standby to new primary d) Enable writes on new primary
- Applicable scenarios: Financial trading, order systems and other businesses with strict data consistency requirements
-
Availability-First Strategy: Ensures service availability as highest priority
- Typical implementation steps: a) Directly promote standby to new primary b) Allow new primary to immediately accept writes c) Original primary catches up with data
- Applicable scenarios: Social networks, content platforms and other businesses prioritizing service continuity
Primary-Standby Lag
Primary-standby lag is a performance indicator caused by master-slave database synchronization delay.
Key time points:
- T1: Primary A completes transaction execution and writes to binlog
- T2: Standby B completely receives that binlog
- T3: Standby B executes the transaction in that binlog
Primary-standby lag calculation formula: Lag time = T3 - T1
Important fields:
Seconds_Behind_Master: Represents current standby lag in secondsRelay_Log_Pos: Current binlog position being executed by standbyMaster_Log_File: Master binlog file being read by standby
Lag Causes
-
Standby machine performance issues
- Insufficient hardware configuration
- Resource overload: One machine simultaneously serves as standby for multiple masters
- Network bottleneck
-
Division issues
- Read operation load: Standby bears many read requests from applications
- Background task interference
-
Large transaction operations
- Single deletion of large amounts of data
- Large table structure changes (such as adding indexes to tables with tens of millions of rows)
- Batch data import
Reliability-First vs Availability-First
Reliability-First Process:
- Check the seconds_behind_master value of Slave B
- Change Primary A to read-only state (readonly=true)
- Wait for Slave B’s seconds_behind_master value to drop to 0
- Change Slave B to writable state (readonly=false)
- Switch business requests to Slave B
Availability-First:
- Don’t wait for master-slave sync to complete, directly switch business requests to Slave B
- Almost no downtime, but causes data inconsistency
In most cases, prioritize reliability strategy. While meeting data reliability, MySQL’s availability depends on the size of synchronization delay - the smaller the delay, the higher the reliability.
Summary
Core advantages of MHA as MySQL high availability solution:
- Fast automatic failover: Completes automatically within 10-30 seconds
- Data consistency guarantee: Automatically identifies and applies latest binlog events
- Excellent performance: Supports multiple replication modes
- Centralized monitoring management: Can manage multiple MySQL clusters
Limitations:
- Requires at least one slave to work
- Brief write interruption during failover
- Requires additional Manager node for monitoring management
Currently MHA mainly supports one-master-multi-slave architecture. To build MHA, a replication cluster must have a minimum of three database servers.
Update Loss Problem
- Async replication causes update overwrite
MMM Architecture
Overview
MMM (Master-Master Replication Manager for MySQL) is an open-source MySQL high availability solution.
Core Functions
- Monitor MySQL master-slave replication status
- Automatically detect node failures and perform failover
- Manage virtual IP addresses for transparent client switching
Node Roles
- Writer node: Handles all write operations
- Reader node: Handles read requests
Fault Handling Process
- VIP removal
- Role switching
- Topology reconfiguration
Monitoring Mechanism
Monitor Server
- Real-time collection of node health status indicators
- Determines node faults through heartbeat detection
Agent
- Runs on each MySQL server
- Monitors local MySQL service status
- Executes management instructions from Monitor