Skip to content

Cluster Architecture

The Monibuca V6 cluster solution is based on the Origin-Edge model. Nodes communicate via the QUIC protocol for low-latency, high-reliability communication, supporting automatic node discovery, intelligent stream forwarding, and load balancing.

┌──────────────┐
│ DNS/LB │ ← User Entry Point
└──────┬───────┘
┌───────────────┼───────────────┐
│ │ │
┌──────▼─────┐ ┌─────▼──────┐ ┌─────▼──────┐
│ Edge-BJ │ │ Edge-SH │ │ Edge-GZ │
│ (Beijing) │ │ (Shanghai) │ │ (Guangzhou) │
└──────┬─────┘ └─────┬──────┘ └─────┬──────┘
│ │ │
│ QUIC Relay │
│ │ │
┌──────▼───────────────▼───────────────▼──────┐
│ Origin Cluster │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Origin-1 │ │ Origin-2 │ │ Origin-3 │ │
│ └──────────┘ └──────────┘ └──────────┘ │
└─────────────────────────────────────────────┘
RoleResponsibility
OriginReceives direct ingest from publishers; serves as the stream source. Provides stream data to Edge nodes
EdgeReceives pull requests from viewers. Automatically pulls from Origin when a stream is not available locally

Cluster nodes communicate using the QUIC protocol, leveraging its inherent advantages:

FeatureTCPQUIC
Connection establishment1-3 RTT (including TLS)0-1 RTT
Head-of-line blockingEntire connection blockedOnly individual stream blocked
MultiplexingRequires HTTP/2Native support
Connection migrationNot supportedSupported (survives IP changes)
Congestion controlShared across connectionIndependent per stream
┌─────────────────────────────┐
│ Application Layer │
│ (gRPC: Node Sync/Control) │
├─────────────────────────────┤
│ Relay Layer │
│ (Audio/Video Data Transfer)│
├─────────────────────────────┤
│ QUIC Transport │
│ (quinn: Connection Mgmt) │
├─────────────────────────────┤
│ TLS 1.3 │
│ (Self-signed/Auto Certs) │
└─────────────────────────────┘
  • gRPC Layer: Node discovery, heartbeat detection, stream session registration, Catalog synchronization
  • Relay Layer: High-speed audio/video frame data transfer, transmitted directly over QUIC streams
  • Transport Layer: QUIC connection management based on the quinn library, with automatic certificate generation

When a new node starts, it connects to the seed nodes configured in seed_servers to obtain cluster topology information:

New Node → Seed Node: "I am edge-3, please tell me who is in the cluster"
Seed Node → New Node: [origin-1, edge-1, edge-2, ...]
New Node → Each Node: Establish QUIC connections
┌────────┐ heartbeat (every 5s) ┌────────┐
│ Node A │ ──────────────────────► │ Node B │
│ │ ◄────────────────────── │ │
│ │ heartbeat_ack │ │
└────────┘ └────────┘

Each heartbeat carries node summary information:

  • CPU usage
  • Memory usage
  • Bandwidth usage
  • Current stream count
  • Subscriber count

Three-level failure detection based on heartbeat timeout:

StateConditionBehavior
HealthyHeartbeat normalParticipates in cluster normally
Suspectsuspect_threshold consecutive missed responsesMarked as suspect, weight reduced
Offlineoffline_threshold consecutive missed responsesMarked as offline, sessions cleaned up

When a node is marked as Offline, the following actions are triggered:

  1. SessionRegistry clears all stream registrations for that node
  2. RelayManager disconnects all Relay connections to that node
  3. AllocationManager stops assigning requests to that node
Viewer → Edge: Request live/camera01
Edge: Stream not available locally, query SessionRegistry
SessionRegistry → Edge: Stream is on Origin-1
Edge → Origin-1: Establish QUIC Relay connection
Origin-1 → Edge: Transfer audio/video data via QUIC
Edge → Viewer: Distribute to local subscribers
  1. Establishment: Edge detects no local stream and initiates a Relay request to Origin via QUIC
  2. Transfer: RingBuffer data from Origin is transmitted to Edge over QUIC streams
  3. Health Check: Relay connection status is periodically checked (every health_check_interval seconds)
  4. Release: After the last subscriber on the Edge leaves, the Relay is released after waiting release_delay seconds

When a Relay connection is broken:

  1. RelayManager detects the connection anomaly
  2. Waits retry_delay seconds before retrying
  3. Retries up to max_retry_attempts times
  4. If the Origin node is offline, queries SessionRegistry for a new Origin
  5. After all retries fail, the stream on the Edge is marked as unavailable

AllocationManager selects the optimal node for new pull requests, considering:

  1. Node health status: Only selects nodes in Healthy state
  2. Load metrics: CPU usage, bandwidth usage, subscriber count
  3. Proximity: Prefers nodes in the same region as the request source
  4. Stream availability: Prefers nodes that already hold the target stream (to avoid pulling from origin)

When an Edge node is overloaded, RedirectManager automatically redirects new requests to other nodes:

Viewer → Edge-BJ: Request live/camera01
Edge-BJ: CPU usage 90% > threshold 85%
Edge-BJ → Viewer: HTTP 302 → Edge-SH
Viewer → Edge-SH: Request live/camera01

Redirect threshold configuration:

cluster:
routing:
cpu_threshold: 85.0 # CPU usage threshold
bandwidth_threshold: 8000.0 # Bandwidth threshold (Mbps)
subscriber_threshold: 2000 # Subscriber count threshold

The most common deployment mode, suitable for small to medium scale:

# Origin configuration
cluster:
sync:
server_id: "origin-1"
address: "10.0.1.1:8001"
seed_servers: ["10.0.1.1:8001"]
# Edge configuration
cluster:
sync:
server_id: "edge-bj-1"
address: "10.0.2.1:8001"
seed_servers: ["10.0.1.1:8001"] # Points to Origin

Large-scale deployment with multiple Origins sharing the ingest load:

# Origin-1
cluster:
sync:
server_id: "origin-1"
address: "10.0.1.1:8001"
seed_servers: ["10.0.1.1:8001", "10.0.1.2:8001"]
# Origin-2
cluster:
sync:
server_id: "origin-2"
address: "10.0.1.2:8001"
seed_servers: ["10.0.1.1:8001", "10.0.1.2:8001"]
# Edge
cluster:
sync:
server_id: "edge-1"
address: "10.0.2.1:8001"
seed_servers: ["10.0.1.1:8001", "10.0.1.2:8001"]

Edge nodes can also serve as upstream for other Edges, forming a multi-tier cascade:

Ingest → Origin → Edge-L1 (Regional Hub) → Edge-L2 (City Node) → Viewer

Suitable for large-scale nationwide distribution scenarios.

GET /cluster/status
GET /cluster/nodes
GET /cluster/sessions
POST /cluster/rebalance