MySQL Dual-Master Architecture and MMM High Availability

Dual-Master Mode

Applicable Scenarios

MySQL dual-master mode is a common high-availability database architecture solution, particularly suitable for the following business scenarios:

1. Rapid Business Growth Phase

Many startup companies and small-to-medium enterprises initially adopt MySQL master-slave architecture, achieving read-write separation through one master (write database) and multiple slaves (read databases). As business scale expands, the limitations of this architecture become apparent:

Single master node becomes a system bottleneck, with all write operations concentrated on one node
When the master fails, manual intervention is needed to promote a slave to master, with complex switching process and data loss risk
Peak periods may experience insufficient single master database performance

2. Systems with High Availability Requirements

Dual-master mode improves system availability through:

Two database nodes act as master to each other, forming bidirectional replication
When either node fails, the other node can immediately take over services
Automatic failover reduces manual intervention
Avoids single point of failure, improving overall system stability

3. Scenarios with Balanced Read-Write Pressure

Typical applications include:

E-commerce platform order systems (dual-master nodes can simultaneously handle order creation and status updates)
Social media user data services (users in different regions can write to the nearest node)
Financial trading system accounting (ensures high availability of transaction records)

Technical Implementation Details

The working mechanism of dual-master mode:

Data synchronization: Both MySQL instances are configured as master and as each other’s slave
Conflict handling: Avoid primary key conflicts by setting auto_increment_increment and auto_increment_offset
Replication mode: Usually adopts row-based replication (RBR) to ensure data consistency
Fault detection: Cooperates with keepalived or MHA to achieve automatic failover

Notes

Ensure stable network connection to avoid replication lag
Recommend using the same version of MySQL server
Monitor replication status and handle replication errors promptly
Configure connection pool failover properly in the application

Dual-Master Dual-Write vs Dual-Master Single-Write?

It is recommended to use dual-master single-write because dual-master dual-write has the following issues:

ID Conflict Problem

When writing data to Master A, if the data has not yet synchronized to Master B, writing data to Master B will cause ID conflicts. Specific manifestations:

When using auto-increment IDs, two masters may generate the same ID values
Although different auto-increment steps can be set to avoid this (e.g., Master A set to odd steps: 1, 3, 5, 7; Master B set to even steps: 2, 4, 6, 8), this approach brings:
- Increased operational complexity
- Difficult future expansion
- Consecutive primary key values may cause performance issues

Update Loss Problem

When the same record is simultaneously updated on both masters:

User A updates a record on Master1
User B updates the same record on Master2
Due to asynchronous replication, the first updated data may be overwritten by the later updated data
Results in some update operations being lost

High Availability Architecture Recommendations

Recommended dual-master single-write architecture:

Master-Slave Structure:
- One Master acts as the primary database handling all write requests
- The other Master serves as a hot standby for high availability failover
Read-Write Separation:
- Multiple Slave nodes are mounted under the primary database
- Slave nodes handle all read requests
Failover:
- When the primary database fails, the standby automatically switches to the new primary
- When the original primary recovers, it automatically degrades to standby

This architecture ensures both high availability and avoids data consistency issues caused by dual-master dual-write.

As business develops, the architecture evolves from master-slave to dual-master mode. It is recommended to use dual-master single-write, then introduce high availability components such as Keeplived and MMM tools to achieve automatic primary database failover.

MMM Architecture

Overview

MMM (Master-Master Replication Manager for MySQL) is an open-source MySQL database high-availability solution specifically designed for managing and monitoring MySQL dual-master replication architecture, with support for automatic primary node failover. Developed by MySQL community developers in Perl and first released in 2008, it is widely used in MySQL deployments requiring high availability.

MMM’s core functions include:

Monitor MySQL master-slave replication status
Automatically detect node failures and perform failover
Manage virtual IP addresses for transparent client switching
Provide node status monitoring and alerting

Although MMM uses a dual-master (Master-Master) architecture, during actual business operation only one node is in writable state (Active Master) at any given time, while the other node serves as hot standby (Passive Master). This design ensures high availability while avoiding potential data conflict issues from bidirectional replication. When the active primary node fails, MMM automatically switches write operations to the standby primary node while adjusting replication relationships to ensure database service continuity.

Typical Application Scenarios:

E-commerce website order databases
Financial trading systems
Enterprise applications requiring 24/7 high availability

Notes:

Requires MySQL replication functionality
Recommended for deployment in reliable network environments
May experience brief service interruptions during switching
Need to configure monitoring parameters properly to avoid false switching

MMM Fault Handling Mechanism

MMM (Master-Master Replication Manager) is a tool for managing MySQL master-master replication architecture that can automatically handle node failures and maintain high availability. The system contains two main types of roles:

1. Node Role Division

Writer node: Handles all write operations, typically only one active Writer node
Reader node: Handles read requests, can include multiple Slave nodes

2. Writer Node Fault Handling Process

When the Writer node (usually Master1) is detected as failing, MMM automatically executes the following failover process:

VIP removal: Immediately removes the virtual IP (VIP) from the failed node to prevent applications from continuing to connect
Role switching:
- Automatically switches write operations to Master2 node
- Promotes Master2 to the new Writer role
Topology reconfiguration:
- Reconfigures all Slave nodes to point to the new Master2
- Updates replication relationships to ensure data consistency

3. Slave Node Management Mechanism

MMM not only manages primary nodes but also continuously monitors all Slave node status, handling the following exceptional situations:

Node down: Immediately removes that node’s VIP and marks it as unavailable
Replication lag: When lag exceeds the preset threshold, automatically removes the node from the read pool
Replication errors: After detecting replication errors, pauses that node’s service until the problem is resolved

Fault Recovery Process:

Continuously monitor failed node status
When the node recovers:
- Rejoin the replication topology
- Restore VIP assignment
- Rejoin the read pool to provide services

4. Typical Application Scenarios

E-commerce websites: Ensure uninterrupted order writing
Financial services: Guarantee transaction data consistency
Social platforms: Maintain high availability of read services

This mechanism maximizes reduction of manual intervention through automated fault detection and switching, ensuring high availability of database services.

MMM Monitoring Mechanism

1. System Architecture and Component Functions

MMM (Master-Master Replication Manager) monitoring system uses a master-slave architecture with two core components:

(1) Monitor Server

Core responsibilities: Acts as the monitoring system’s nerve center, continuously monitoring the entire MySQL cluster’s operational status
Deployment method: Recommended to independently deploy on a dedicated monitoring server, physically isolated from database servers
Main functions:
- Real-time collection of node health status indicators (such as service availability, replication lag)
- Determines node faults through heartbeat detection mechanism
- Triggers failover process based on preset policies
- Records fault events and switching logs
Typical configuration: Recommend deploying 2 Monitors in primary-backup architecture to avoid single point of failure

(2) Agent

Running location: Must be installed on each MySQL server instance
Working mode: Runs as a persistent process (usually in daemon mode)
Core functions:
- Monitoring execution:
  - Regularly collects local MySQL service status (such as process status, port listening)
  - Checks master-slave replication status (Seconds_Behind_Master, etc.)
  - Monitors system resources (CPU, memory, disk space)
- Command execution:
  - Receives and executes management instructions from Monitor
  - Completes VIP migration operations
  - Modifies MySQL read-only/read-write status
  - Controls replication thread start/stop
Communication mechanism: Maintains heartbeat communication with Monitor through encrypted channel (default port 9988)

2. Typical Workflow Example

Taking primary node failover scenario as an example:

Fault detection phase:
- Agent sends heartbeat packet to Monitor every 10 seconds
- Monitor misses 3 consecutive heartbeat responses from primary node (configurable)
- Monitor verifies node status through backup channel (such as SSH)
Switching decision phase:
- Monitor confirms primary node is unreachable (exceeding threshold time such as 30 seconds)
- Checks if slave replication lag is within allowable range (e.g., <30 seconds)
- Selects the most suitable slave for promotion based on priority
Switching execution phase:
- Monitor sends promotion instruction to new primary node’s Agent
- Agent executes: a. Stops replication thread b. Resets read_only parameter c. Binds VIP to new primary
- Monitor sends to other slave nodes’ Agents: a. Modify replication source to point to new primary b. Restart replication thread
Status synchronization phase:
- Each Agent reports execution results
- Monitor updates cluster topology status
- Records complete switching log

3. Network Configuration Requirements

To ensure reliable monitoring system operation, the following network conditions must be met:

Direction	Protocol	Port	Purpose
Monitor→Agent	TCP	9988	Control command transmission
Agent→Monitor	TCP	9989	Status report
Agent to Agent	ICMP	-	Inter-node connectivity detection
VIP network	-	-	Requires layer-2 reachable broadcast domain

4. High Availability Safeguards

Monitor redundancy:
- Deploy 2 Monitor instances
- Use keepalived to achieve VIP floating
- Coordinate primary-backup roles through distributed locks
Agent self-healing mechanism:
- Automatic restart after heartbeat timeout
- Local state cache prevents network partition misjudgment
- Pre-check before critical operations
Network reliability:
- Recommend configuring independent management network
- Important links use bonding multi-NIC binding
- Set reasonable TCP retry parameters

slave_parallel_type=LOGICAL_CLOCK
binlog_group_commit_sync_delay=10000

MySQL 8.0 - Writeset-Based Parallel Replication

Uses binlog_transaction_dependency_tracking parameter
Supports WRITESET or WRITESET_SESSION mode
Automatically identifies row-level non-conflicting transactions

Implementation Principles

Uses primary key hash
Higher parallelism

InnoDB Two-Phase Commit

Prepare phase
Commit phase

Binlog Markers

last_committed
sequence_number

Performance Impact

Can reduce replication lag by 60%-80%
Can improve CPU utilization by 30%-50%