Distributed System Concepts

A distributed system is a hardware or software component distributed across different network computers, communicating and coordinating solely through message passing. Distributed services refer to deploying different functional modules of a system on multiple server nodes, communicating and collaborating through networks to achieve overall system functional goals.

Why Do We Need Distributed Services?

Bottlenecks of Monolithic Architecture

  • Single point of failure: Once it goes down, entire system is unavailable
  • Performance limited: Difficult to scale horizontally, all logic squeezed on one machine
  • Slow iteration: Even a small change requires restarting the entire service
  • Messy dependencies: Highly coupled code, high maintenance cost

Advantages of Distributed Architecture

  • Horizontal scaling (expand by adding nodes)
  • Decouple modules, improve development efficiency
  • Improve system stability and maintainability
  • Support heterogeneous languages, technology stack development

Difference Between Distributed and Cluster

  • Cluster: Multiple people doing the same thing together
  • Distributed: Multiple people doing different things together

Core Components

  • Service Provider (Provider): Implements business logic, provides services externally
  • Service Consumer (Consumer): Calls remote services, obtains required data
  • Registry (Registry): Stores service addresses, implements service discovery and subscription
  • Configuration Center (Config Center): Stores unified configuration, facilitates dynamic distribution and updates
  • Gateway/Reverse Proxy: Implements unified entry, security authentication, rate limiting, etc.
  • Monitoring/Link Tracking: Tracks service call chains, investigates bottlenecks and exceptions

Common Patterns

Microservices Architecture

  • Each service has single responsibility, deployed and scaled independently
  • Communication: HTTP REST, gRPC, message queues

SOA (Service-Oriented Architecture)

  • More “heavy” than microservices, emphasizes Enterprise Service Bus (ESB)
  • Communication mostly uses SOAP/Web Service

Communication Method Comparison

  • RESTful API: Based on HTTP, easy development, widely used
  • gRPC (Based on HTTP/2 + Protobuf): High performance, strongly typed
  • Message Queue (such as Kafka/RabbitMQ): Asynchronous decoupling, peak shaving
  • Thrift / Dubbo / Hessian: High-performance RPC frameworks

Problems Faced by Distributed Systems

Communication Exceptions

Network itself is unreliable, therefore each network communication carries the risk of network unavailability. Even if network communication between nodes in a distributed system executes normally, its latency will be greater than single-machine operation.

  1. Vulnerability of Network Infrastructure

    • Physical layer failures: Fiber breaks, switch failures, etc.
    • Routing anomalies: BGP routing leaks, etc.
    • DNS failures
  2. Network Latency Issues

    • Cross-data center communication limited by speed of light
    • Network congestion causing queuing delay
    • Fixed latency from TCP/TLS handshake
  3. Quantitative Comparison

    • Single-machine memory access: approximately 100ns
    • Same-room network latency: approximately 0.5-2ms
    • Cross-region latency: Beijing-Shanghai approximately 30ms, China-US approximately 150ms

Network Partition

Network partition refers to the situation where nodes cannot communicate normally due to network failures in a distributed system.

Basic Characteristics

  1. Network Isolation: Different partitions completely lose network connection
  2. Internal Partition Normal: Nodes within each partition can still communicate with each other
  3. Coexistence of Multiple Partitions: Usually forms two or more independently operating subnetworks

Node Failure

Node failure is one of the most common problems in distributed systems, referring to servers comprising the distributed system experiencing crashes or deadlocks.

Node failure manifestations include:

  1. Hardware Failure: Memory corruption, disk failure, power issues
  2. Software Failure: Operating system crash, application deadlock
  3. Network Problems: Node loses network connection with other cluster members

Three States

Each request and response in distributed systems has a unique three-state concept: Success, Failure, and Timeout!

State Definitions

  1. Success: Request is completely received and processed, response also successfully returned to requester
  2. Failure: Request received but error occurred during processing, system explicitly returned error response
  3. Timeout: No response received within specified time, system cannot determine final state of request

Timeout Cause Analysis

  1. Request Lost Timeout: Due to network reasons, request was not successfully sent to receiver
  2. Response Lost Timeout: Request successfully received and processed by receiver, but loss occurred during response feedback

Common Solutions

  1. Retry mechanism (with backoff strategy)
  2. Transaction logs and state tracking
  3. Eventual consistency design
  4. Two-phase commit protocol