MHA Architecture Deep Dive

MHA Basic Concepts

MHA (Master High Availability) is a mature MySQL high availability solution developed by Yoshinori Matsunobu from Japan’s DeNA company. It is primarily used to implement automated failover and master-slave switching in MySQL master-slave architecture.

Core Features

  1. Fast Failover:

    • Automatically completes master failover within 30 seconds
    • Typical failover time is between 10-30 seconds
    • Supports manually triggered master switch (planned switch), usually taking only 0.5-2 seconds
  2. Data Consistency Guarantee:

    • Ensures data consistency by comparing relay log differences between slaves
    • Supports automatically fetching binlog from old master for data repair
  3. Flexible Deployment:

    • Supports standard master-slave replication architecture
    • Compatible with GTID-based replication mode

Typical Application Scenarios

  1. Financial Trading Systems: Payment systems requiring zero data loss
  2. E-commerce Platforms: Order processing systems during promotions, product inventory management systems
  3. Social Networking Services: User relationship storage systems, instant messaging storage systems

Working Principles

  1. Monitoring Phase: MHA Manager checks master availability through periodic pings
  2. Fault Detection: When master is unreachable, initiates secondary confirmation mechanism
  3. Failover Process: Identifies the slave with data closest to master, applies differential relay log to ensure data consistency

Components

1. MHA Manager (Management Node)

MHA Manager is the control center of the entire architecture, responsible for monitoring and managing the entire MySQL master-slave replication cluster.

  • Flexible deployment: Can be independently deployed on a dedicated management server or on a Slave node

  • Main responsibilities:

    • Continuously monitors Master node health status
    • Automatically triggers and controls failover when Master fails
    • Real-time checks MySQL replication status and lag
  • Typical workflow:

    1. Checks Master reachability every 3 seconds (configurable)
    2. Determines Master failure after 3 consecutive failed checks
    3. Automatically selects the Slave with the latest data as new Master

2. MHA Node (Data Node)

MHA Node is an agent program running on each MySQL instance.

  • Core functions:
    • On Master node: Real-time saves and transfers binary logs (binlog)
    • On Slave node: Receives and applies relay logs
    • Precisely identifies log difference points between Slaves
    • Applies missing log events to lagging Slaves

Fault Handling

MHA Fault Handling Mechanism Deep Dive

  1. Save binary logs from crashed master

    • After automatically detecting Master crash, immediately saves complete binlog from Master server
    • Connects to original Master server via SSH and transfers unsynchronized binlog to management node
  2. Locate the latest slave

    • Compares all Slave’s SHOW SLAVE STATUS information
    • Determines which Slave has data closest to Master based on Exec_Master_Log_Pos and Relay_Log_Pos
  3. Repair other slaves

    • Gets relay log information from the latest Slave
    • Uses mysqlbinlog tool to parse and replay relay log to other lagging Slaves
  4. Master-slave switch operation

    • Executes STOP SLAVE; RESET MASTER; on the latest Slave
    • Modifies my.cnf configuration, enables log-bin and other master-required parameters
    • Promotes to new Master through CHANGE MASTER TO command
  5. Rebuild replication topology

    STOP SLAVE;
    CHANGE MASTER TO MASTER_HOST='new_master_ip';
    START SLAVE;

Primary-Standby Switch

Primary-standby switch refers to the process of converting a standby database to primary and the original primary to standby.

Primary-Standby Switch Strategies

  1. Reliability-First Strategy: Ensures data consistency as highest priority

    • Typical implementation steps: a) Stop writes to primary b) Wait for standby to catch up with primary (Seconds_Behind_Master=0) c) Promote standby to new primary d) Enable writes on new primary
    • Applicable scenarios: Financial trading, order systems and other businesses with strict data consistency requirements
  2. Availability-First Strategy: Ensures service availability as highest priority

    • Typical implementation steps: a) Directly promote standby to new primary b) Allow new primary to immediately accept writes c) Original primary catches up with data
    • Applicable scenarios: Social networks, content platforms and other businesses prioritizing service continuity

Primary-Standby Lag

Primary-standby lag is a performance indicator caused by master-slave database synchronization delay.

Key time points:

  1. T1: Primary A completes transaction execution and writes to binlog
  2. T2: Standby B completely receives that binlog
  3. T3: Standby B executes the transaction in that binlog

Primary-standby lag calculation formula: Lag time = T3 - T1

Important fields:

  • Seconds_Behind_Master: Represents current standby lag in seconds
  • Relay_Log_Pos: Current binlog position being executed by standby
  • Master_Log_File: Master binlog file being read by standby

Lag Causes

  1. Standby machine performance issues

    • Insufficient hardware configuration
    • Resource overload: One machine simultaneously serves as standby for multiple masters
    • Network bottleneck
  2. Division issues

    • Read operation load: Standby bears many read requests from applications
    • Background task interference
  3. Large transaction operations

    • Single deletion of large amounts of data
    • Large table structure changes (such as adding indexes to tables with tens of millions of rows)
    • Batch data import

Reliability-First vs Availability-First

Reliability-First Process:

  • Check the seconds_behind_master value of Slave B
  • Change Primary A to read-only state (readonly=true)
  • Wait for Slave B’s seconds_behind_master value to drop to 0
  • Change Slave B to writable state (readonly=false)
  • Switch business requests to Slave B

Availability-First:

  • Don’t wait for master-slave sync to complete, directly switch business requests to Slave B
  • Almost no downtime, but causes data inconsistency

In most cases, prioritize reliability strategy. While meeting data reliability, MySQL’s availability depends on the size of synchronization delay - the smaller the delay, the higher the reliability.


Summary

Core advantages of MHA as MySQL high availability solution:

  1. Fast automatic failover: Completes automatically within 10-30 seconds
  2. Data consistency guarantee: Automatically identifies and applies latest binlog events
  3. Excellent performance: Supports multiple replication modes
  4. Centralized monitoring management: Can manage multiple MySQL clusters

Limitations:

  • Requires at least one slave to work
  • Brief write interruption during failover
  • Requires additional Manager node for monitoring management

Currently MHA mainly supports one-master-multi-slave architecture. To build MHA, a replication cluster must have a minimum of three database servers.

Update Loss Problem

  • Async replication causes update overwrite

MMM Architecture

Overview

MMM (Master-Master Replication Manager for MySQL) is an open-source MySQL high availability solution.

Core Functions

  1. Monitor MySQL master-slave replication status
  2. Automatically detect node failures and perform failover
  3. Manage virtual IP addresses for transparent client switching

Node Roles

  • Writer node: Handles all write operations
  • Reader node: Handles read requests

Fault Handling Process

  1. VIP removal
  2. Role switching
  3. Topology reconfiguration

Monitoring Mechanism

Monitor Server

  • Real-time collection of node health status indicators
  • Determines node faults through heartbeat detection

Agent

  • Runs on each MySQL server
  • Monitors local MySQL service status
  • Executes management instructions from Monitor