Dual-Master Mode

Applicable Scenarios

MySQL dual-master mode is a common high-availability database architecture solution, particularly suitable for the following business scenarios:

1. Rapid Business Growth Phase

Many startup companies and small-to-medium enterprises initially adopt MySQL master-slave architecture, achieving read-write separation through one master (write database) and multiple slaves (read databases). As business scale expands, the limitations of this architecture become apparent:

  • Single master node becomes a system bottleneck, with all write operations concentrated on one node
  • When the master fails, manual intervention is needed to promote a slave to master, with complex switching process and data loss risk
  • Peak periods may experience insufficient single master database performance

2. Systems with High Availability Requirements

Dual-master mode improves system availability through:

  • Two database nodes act as master to each other, forming bidirectional replication
  • When either node fails, the other node can immediately take over services
  • Automatic failover reduces manual intervention
  • Avoids single point of failure, improving overall system stability

3. Scenarios with Balanced Read-Write Pressure

Typical applications include:

  • E-commerce platform order systems (dual-master nodes can simultaneously handle order creation and status updates)
  • Social media user data services (users in different regions can write to the nearest node)
  • Financial trading system accounting (ensures high availability of transaction records)

Technical Implementation Details

The working mechanism of dual-master mode:

  1. Data synchronization: Both MySQL instances are configured as master and as each other’s slave
  2. Conflict handling: Avoid primary key conflicts by setting auto_increment_increment and auto_increment_offset
  3. Replication mode: Usually adopts row-based replication (RBR) to ensure data consistency
  4. Fault detection: Cooperates with keepalived or MHA to achieve automatic failover

Notes

  • Ensure stable network connection to avoid replication lag
  • Recommend using the same version of MySQL server
  • Monitor replication status and handle replication errors promptly
  • Configure connection pool failover properly in the application

Dual-Master Dual-Write vs Dual-Master Single-Write?

It is recommended to use dual-master single-write because dual-master dual-write has the following issues:

ID Conflict Problem

When writing data to Master A, if the data has not yet synchronized to Master B, writing data to Master B will cause ID conflicts. Specific manifestations:

  • When using auto-increment IDs, two masters may generate the same ID values
  • Although different auto-increment steps can be set to avoid this (e.g., Master A set to odd steps: 1, 3, 5, 7; Master B set to even steps: 2, 4, 6, 8), this approach brings:
    • Increased operational complexity
    • Difficult future expansion
    • Consecutive primary key values may cause performance issues

Update Loss Problem

When the same record is simultaneously updated on both masters:

  1. User A updates a record on Master1
  2. User B updates the same record on Master2
  3. Due to asynchronous replication, the first updated data may be overwritten by the later updated data
  4. Results in some update operations being lost

High Availability Architecture Recommendations

Recommended dual-master single-write architecture:

  1. Master-Slave Structure:

    • One Master acts as the primary database handling all write requests
    • The other Master serves as a hot standby for high availability failover
  2. Read-Write Separation:

    • Multiple Slave nodes are mounted under the primary database
    • Slave nodes handle all read requests
  3. Failover:

    • When the primary database fails, the standby automatically switches to the new primary
    • When the original primary recovers, it automatically degrades to standby

This architecture ensures both high availability and avoids data consistency issues caused by dual-master dual-write.

As business develops, the architecture evolves from master-slave to dual-master mode. It is recommended to use dual-master single-write, then introduce high availability components such as Keeplived and MMM tools to achieve automatic primary database failover.


MMM Architecture

Overview

MMM (Master-Master Replication Manager for MySQL) is an open-source MySQL database high-availability solution specifically designed for managing and monitoring MySQL dual-master replication architecture, with support for automatic primary node failover. Developed by MySQL community developers in Perl and first released in 2008, it is widely used in MySQL deployments requiring high availability.

MMM’s core functions include:

  1. Monitor MySQL master-slave replication status
  2. Automatically detect node failures and perform failover
  3. Manage virtual IP addresses for transparent client switching
  4. Provide node status monitoring and alerting

Although MMM uses a dual-master (Master-Master) architecture, during actual business operation only one node is in writable state (Active Master) at any given time, while the other node serves as hot standby (Passive Master). This design ensures high availability while avoiding potential data conflict issues from bidirectional replication. When the active primary node fails, MMM automatically switches write operations to the standby primary node while adjusting replication relationships to ensure database service continuity.

Typical Application Scenarios:

  • E-commerce website order databases
  • Financial trading systems
  • Enterprise applications requiring 24/7 high availability

Notes:

  • Requires MySQL replication functionality
  • Recommended for deployment in reliable network environments
  • May experience brief service interruptions during switching
  • Need to configure monitoring parameters properly to avoid false switching

MMM Fault Handling Mechanism

MMM (Master-Master Replication Manager) is a tool for managing MySQL master-master replication architecture that can automatically handle node failures and maintain high availability. The system contains two main types of roles:

1. Node Role Division

  • Writer node: Handles all write operations, typically only one active Writer node
  • Reader node: Handles read requests, can include multiple Slave nodes

2. Writer Node Fault Handling Process

When the Writer node (usually Master1) is detected as failing, MMM automatically executes the following failover process:

  1. VIP removal: Immediately removes the virtual IP (VIP) from the failed node to prevent applications from continuing to connect
  2. Role switching:
    • Automatically switches write operations to Master2 node
    • Promotes Master2 to the new Writer role
  3. Topology reconfiguration:
    • Reconfigures all Slave nodes to point to the new Master2
    • Updates replication relationships to ensure data consistency

3. Slave Node Management Mechanism

MMM not only manages primary nodes but also continuously monitors all Slave node status, handling the following exceptional situations:

  • Node down: Immediately removes that node’s VIP and marks it as unavailable
  • Replication lag: When lag exceeds the preset threshold, automatically removes the node from the read pool
  • Replication errors: After detecting replication errors, pauses that node’s service until the problem is resolved

Fault Recovery Process:

  1. Continuously monitor failed node status
  2. When the node recovers:
    • Rejoin the replication topology
    • Restore VIP assignment
    • Rejoin the read pool to provide services

4. Typical Application Scenarios

  • E-commerce websites: Ensure uninterrupted order writing
  • Financial services: Guarantee transaction data consistency
  • Social platforms: Maintain high availability of read services

This mechanism maximizes reduction of manual intervention through automated fault detection and switching, ensuring high availability of database services.


MMM Monitoring Mechanism

1. System Architecture and Component Functions

MMM (Master-Master Replication Manager) monitoring system uses a master-slave architecture with two core components:

(1) Monitor Server
  • Core responsibilities: Acts as the monitoring system’s nerve center, continuously monitoring the entire MySQL cluster’s operational status
  • Deployment method: Recommended to independently deploy on a dedicated monitoring server, physically isolated from database servers
  • Main functions:
    • Real-time collection of node health status indicators (such as service availability, replication lag)
    • Determines node faults through heartbeat detection mechanism
    • Triggers failover process based on preset policies
    • Records fault events and switching logs
  • Typical configuration: Recommend deploying 2 Monitors in primary-backup architecture to avoid single point of failure
(2) Agent
  • Running location: Must be installed on each MySQL server instance
  • Working mode: Runs as a persistent process (usually in daemon mode)
  • Core functions:
    • Monitoring execution:
      • Regularly collects local MySQL service status (such as process status, port listening)
      • Checks master-slave replication status (Seconds_Behind_Master, etc.)
      • Monitors system resources (CPU, memory, disk space)
    • Command execution:
      • Receives and executes management instructions from Monitor
      • Completes VIP migration operations
      • Modifies MySQL read-only/read-write status
      • Controls replication thread start/stop
  • Communication mechanism: Maintains heartbeat communication with Monitor through encrypted channel (default port 9988)

2. Typical Workflow Example

Taking primary node failover scenario as an example:

  1. Fault detection phase:

    • Agent sends heartbeat packet to Monitor every 10 seconds
    • Monitor misses 3 consecutive heartbeat responses from primary node (configurable)
    • Monitor verifies node status through backup channel (such as SSH)
  2. Switching decision phase:

    • Monitor confirms primary node is unreachable (exceeding threshold time such as 30 seconds)
    • Checks if slave replication lag is within allowable range (e.g., <30 seconds)
    • Selects the most suitable slave for promotion based on priority
  3. Switching execution phase:

    • Monitor sends promotion instruction to new primary node’s Agent
    • Agent executes: a. Stops replication thread b. Resets read_only parameter c. Binds VIP to new primary
    • Monitor sends to other slave nodes’ Agents: a. Modify replication source to point to new primary b. Restart replication thread
  4. Status synchronization phase:

    • Each Agent reports execution results
    • Monitor updates cluster topology status
    • Records complete switching log

3. Network Configuration Requirements

To ensure reliable monitoring system operation, the following network conditions must be met:

DirectionProtocolPortPurpose
Monitor→AgentTCP9988Control command transmission
Agent→MonitorTCP9989Status report
Agent to AgentICMP-Inter-node connectivity detection
VIP network--Requires layer-2 reachable broadcast domain

4. High Availability Safeguards

  1. Monitor redundancy:

    • Deploy 2 Monitor instances
    • Use keepalived to achieve VIP floating
    • Coordinate primary-backup roles through distributed locks
  2. Agent self-healing mechanism:

    • Automatic restart after heartbeat timeout
    • Local state cache prevents network partition misjudgment
    • Pre-check before critical operations
  3. Network reliability:

    • Recommend configuring independent management network
    • Important links use bonding multi-NIC binding
    • Set reasonable TCP retry parameters
slave_parallel_type=LOGICAL_CLOCK
binlog_group_commit_sync_delay=10000

MySQL 8.0 - Writeset-Based Parallel Replication

  • Uses binlog_transaction_dependency_tracking parameter
  • Supports WRITESET or WRITESET_SESSION mode
  • Automatically identifies row-level non-conflicting transactions

Implementation Principles

  • Uses primary key hash
  • Higher parallelism

InnoDB Two-Phase Commit

  1. Prepare phase
  2. Commit phase

Binlog Markers

  • last_committed
  • sequence_number

Performance Impact

  • Can reduce replication lag by 60%-80%
  • Can improve CPU utilization by 30%-50%