Cluster Architecture

The Monibuca V6 cluster solution is based on the Origin-Edge model. Nodes communicate via the QUIC protocol for low-latency, high-reliability communication, supporting automatic node discovery, intelligent stream forwarding, and load balancing.

Architecture Overview

                     ┌──────────────┐
                     │   DNS/LB     │  ← User Entry Point
                     └──────┬───────┘
                            │
            ┌───────────────┼───────────────┐
            │               │               │
     ┌──────▼─────┐  ┌─────▼──────┐  ┌─────▼──────┐
     │  Edge-BJ   │  │  Edge-SH   │  │  Edge-GZ   │
     │ (Beijing)   │  │ (Shanghai)  │  │ (Guangzhou) │
     └──────┬─────┘  └─────┬──────┘  └─────┬──────┘
            │               │               │
            │          QUIC Relay           │
            │               │               │
     ┌──────▼───────────────▼───────────────▼──────┐
     │                Origin Cluster                │
     │  ┌──────────┐  ┌──────────┐  ┌──────────┐  │
     │  │ Origin-1 │  │ Origin-2 │  │ Origin-3 │  │
     │  └──────────┘  └──────────┘  └──────────┘  │
     └─────────────────────────────────────────────┘

Role Definitions

Role	Responsibility
Origin	Receives direct ingest from publishers; serves as the stream source. Provides stream data to Edge nodes
Edge	Receives pull requests from viewers. Automatically pulls from Origin when a stream is not available locally

QUIC Communication

Cluster nodes communicate using the QUIC protocol, leveraging its inherent advantages:

Why QUIC

Feature	TCP	QUIC
Connection establishment	1-3 RTT (including TLS)	0-1 RTT
Head-of-line blocking	Entire connection blocked	Only individual stream blocked
Multiplexing	Requires HTTP/2	Native support
Connection migration	Not supported	Supported (survives IP changes)
Congestion control	Shared across connection	Independent per stream

Communication Layers

┌─────────────────────────────┐
│     Application Layer       │
│  (gRPC: Node Sync/Control)  │
├─────────────────────────────┤
│     Relay Layer             │
│  (Audio/Video Data Transfer)│
├─────────────────────────────┤
│     QUIC Transport          │
│  (quinn: Connection Mgmt)   │
├─────────────────────────────┤
│     TLS 1.3                 │
│  (Self-signed/Auto Certs)   │
└─────────────────────────────┘

gRPC Layer: Node discovery, heartbeat detection, stream session registration, Catalog synchronization
Relay Layer: High-speed audio/video frame data transfer, transmitted directly over QUIC streams
Transport Layer: QUIC connection management based on the quinn library, with automatic certificate generation

Node Discovery and Health Checks

Seed Node Discovery

When a new node starts, it connects to the seed nodes configured in seed_servers to obtain cluster topology information:

New Node → Seed Node: "I am edge-3, please tell me who is in the cluster"
Seed Node → New Node: [origin-1, edge-1, edge-2, ...]
New Node → Each Node: Establish QUIC connections

Heartbeat Mechanism

┌────────┐  heartbeat (every 5s)  ┌────────┐
│ Node A │ ──────────────────────► │ Node B │
│        │ ◄────────────────────── │        │
│        │  heartbeat_ack          │        │
└────────┘                         └────────┘

Each heartbeat carries node summary information:

CPU usage
Memory usage
Bandwidth usage
Current stream count
Subscriber count

Failure Detection

Three-level failure detection based on heartbeat timeout:

State	Condition	Behavior
Healthy	Heartbeat normal	Participates in cluster normally
Suspect	`suspect_threshold` consecutive missed responses	Marked as suspect, weight reduced
Offline	`offline_threshold` consecutive missed responses	Marked as offline, sessions cleaned up

When a node is marked as Offline, the following actions are triggered:

SessionRegistry clears all stream registrations for that node
RelayManager disconnects all Relay connections to that node
AllocationManager stops assigning requests to that node

Stream Relay

Pull-from-Origin Flow

Viewer → Edge: Request live/camera01
Edge: Stream not available locally, query SessionRegistry
SessionRegistry → Edge: Stream is on Origin-1
Edge → Origin-1: Establish QUIC Relay connection
Origin-1 → Edge: Transfer audio/video data via QUIC
Edge → Viewer: Distribute to local subscribers

Relay Lifecycle

Establishment: Edge detects no local stream and initiates a Relay request to Origin via QUIC
Transfer: RingBuffer data from Origin is transmitted to Edge over QUIC streams
Health Check: Relay connection status is periodically checked (every health_check_interval seconds)
Release: After the last subscriber on the Edge leaves, the Relay is released after waiting release_delay seconds

Failure Recovery

When a Relay connection is broken:

RelayManager detects the connection anomaly
Waits retry_delay seconds before retrying
Retries up to max_retry_attempts times
If the Origin node is offline, queries SessionRegistry for a new Origin
After all retries fail, the stream on the Edge is marked as unavailable

Load Balancing

Allocation Strategy

AllocationManager selects the optimal node for new pull requests, considering:

Node health status: Only selects nodes in Healthy state
Load metrics: CPU usage, bandwidth usage, subscriber count
Proximity: Prefers nodes in the same region as the request source
Stream availability: Prefers nodes that already hold the target stream (to avoid pulling from origin)

Redirect Mechanism

When an Edge node is overloaded, RedirectManager automatically redirects new requests to other nodes:

Viewer → Edge-BJ: Request live/camera01
Edge-BJ: CPU usage 90% > threshold 85%
Edge-BJ → Viewer: HTTP 302 → Edge-SH
Viewer → Edge-SH: Request live/camera01

Redirect threshold configuration:

cluster:
  routing:
    cpu_threshold: 85.0           # CPU usage threshold
    bandwidth_threshold: 8000.0   # Bandwidth threshold (Mbps)
    subscriber_threshold: 2000    # Subscriber count threshold

Deployment Modes

Single Origin + Multiple Edges

The most common deployment mode, suitable for small to medium scale:

# Origin configuration
cluster:
  sync:
    server_id: "origin-1"
    address: "10.0.1.1:8001"
    seed_servers: ["10.0.1.1:8001"]

# Edge configuration
cluster:
  sync:
    server_id: "edge-bj-1"
    address: "10.0.2.1:8001"
    seed_servers: ["10.0.1.1:8001"]  # Points to Origin

Multiple Origins + Multiple Edges

Large-scale deployment with multiple Origins sharing the ingest load:

# Origin-1
cluster:
  sync:
    server_id: "origin-1"
    address: "10.0.1.1:8001"
    seed_servers: ["10.0.1.1:8001", "10.0.1.2:8001"]

# Origin-2
cluster:
  sync:
    server_id: "origin-2"
    address: "10.0.1.2:8001"
    seed_servers: ["10.0.1.1:8001", "10.0.1.2:8001"]

# Edge
cluster:
  sync:
    server_id: "edge-1"
    address: "10.0.2.1:8001"
    seed_servers: ["10.0.1.1:8001", "10.0.1.2:8001"]

Cascaded Edges

Edge nodes can also serve as upstream for other Edges, forming a multi-tier cascade:

Ingest → Origin → Edge-L1 (Regional Hub) → Edge-L2 (City Node) → Viewer

Suitable for large-scale nationwide distribution scenarios.

Operations Commands

View Cluster Status

GET /cluster/status

View Node List

GET /cluster/nodes

View Stream Distribution

GET /cluster/sessions

Manually Trigger Rebalancing

POST /cluster/rebalance