Distributed Services Basic Concepts Clusters Communicatio...

Distributed System Concepts

A distributed system is a hardware or software component distributed across different network computers, communicating and coordinating solely through message passing. Distributed services refer to deploying different functional modules of a system on multiple server nodes, communicating and collaborating through networks to achieve overall system functional goals.

Why Do We Need Distributed Services?

Bottlenecks of Monolithic Architecture

Single point of failure: Once it goes down, entire system is unavailable
Performance limited: Difficult to scale horizontally, all logic squeezed on one machine
Slow iteration: Even a small change requires restarting the entire service
Messy dependencies: Highly coupled code, high maintenance cost

Advantages of Distributed Architecture

Horizontal scaling (expand by adding nodes)
Decouple modules, improve development efficiency
Improve system stability and maintainability
Support heterogeneous languages, technology stack development

Difference Between Distributed and Cluster

Cluster: Multiple people doing the same thing together
Distributed: Multiple people doing different things together

Core Components

Service Provider (Provider): Implements business logic, provides services externally
Service Consumer (Consumer): Calls remote services, obtains required data
Registry (Registry): Stores service addresses, implements service discovery and subscription
Configuration Center (Config Center): Stores unified configuration, facilitates dynamic distribution and updates
Gateway/Reverse Proxy: Implements unified entry, security authentication, rate limiting, etc.
Monitoring/Link Tracking: Tracks service call chains, investigates bottlenecks and exceptions

Common Patterns

Microservices Architecture

Each service has single responsibility, deployed and scaled independently
Communication: HTTP REST, gRPC, message queues

SOA (Service-Oriented Architecture)

More “heavy” than microservices, emphasizes Enterprise Service Bus (ESB)
Communication mostly uses SOAP/Web Service

Communication Method Comparison

RESTful API: Based on HTTP, easy development, widely used
gRPC (Based on HTTP/2 + Protobuf): High performance, strongly typed
Message Queue (such as Kafka/RabbitMQ): Asynchronous decoupling, peak shaving
Thrift / Dubbo / Hessian: High-performance RPC frameworks

Problems Faced by Distributed Systems

Communication Exceptions

Network itself is unreliable, therefore each network communication carries the risk of network unavailability. Even if network communication between nodes in a distributed system executes normally, its latency will be greater than single-machine operation.

Vulnerability of Network Infrastructure
- Physical layer failures: Fiber breaks, switch failures, etc.
- Routing anomalies: BGP routing leaks, etc.
- DNS failures
Network Latency Issues
- Cross-data center communication limited by speed of light
- Network congestion causing queuing delay
- Fixed latency from TCP/TLS handshake
Quantitative Comparison
- Single-machine memory access: approximately 100ns
- Same-room network latency: approximately 0.5-2ms
- Cross-region latency: Beijing-Shanghai approximately 30ms, China-US approximately 150ms

Network Partition

Network partition refers to the situation where nodes cannot communicate normally due to network failures in a distributed system.

Basic Characteristics

Network Isolation: Different partitions completely lose network connection
Internal Partition Normal: Nodes within each partition can still communicate with each other
Coexistence of Multiple Partitions: Usually forms two or more independently operating subnetworks

Node Failure

Node failure is one of the most common problems in distributed systems, referring to servers comprising the distributed system experiencing crashes or deadlocks.

Node failure manifestations include:

Hardware Failure: Memory corruption, disk failure, power issues
Software Failure: Operating system crash, application deadlock
Network Problems: Node loses network connection with other cluster members

Three States

Each request and response in distributed systems has a unique three-state concept: Success, Failure, and Timeout!

State Definitions

Success: Request is completely received and processed, response also successfully returned to requester
Failure: Request received but error occurred during processing, system explicitly returned error response
Timeout: No response received within specified time, system cannot determine final state of request

Timeout Cause Analysis

Request Lost Timeout: Due to network reasons, request was not successfully sent to receiver
Response Lost Timeout: Request successfully received and processed by receiver, but loss occurred during response feedback

Common Solutions

Retry mechanism (with backoff strategy)
Transaction logs and state tracking
Eventual consistency design
Two-phase commit protocol