Cluster Deployment Plugin
The Cluster plugin provides multi-node clustering capabilities for Monibuca, supporting Origin-Edge architecture, automatic node discovery, automatic stream relay, intelligent load balancing, and failover. Inter-node communication is based on the QUIC protocol for low-latency and highly reliable transport.
Enabling the Plugin
Section titled “Enabling the Plugin”Set enable: true in config.yaml (included in official binaries).
Architecture Overview
Section titled “Architecture Overview”┌─────────────┐ QUIC ┌─────────────┐│ Origin │◄────────────►│ Edge-1 ││ (node-1) │ └─────────────┘│ │ QUIC ┌─────────────┐│ Push ingest│◄────────────►│ Edge-2 │└─────────────┘ └─────────────┘ QUIC ┌─────────────┐ ◄────────────►│ Edge-3 │ └─────────────┘- Origin Node: Receives push streams, serves as the stream source
- Edge Node: Receives pull requests; automatically pulls from the Origin when a stream is not available locally
- Each node can act as both Origin and Edge — the role is determined by the actual location of the stream
Configuration
Section titled “Configuration”cluster: sync: enabled: true server_id: "node-1" # Unique node identifier address: "192.168.1.10:8180" # HTTP control-plane address (same as global.http.listenaddr) seed_servers: # Seed list (HTTP addresses, not gRPC :50051) - "192.168.1.10:8180" - "192.168.1.11:8180" - "192.168.1.12:8180" heartbeat_interval: 5 # Heartbeat interval (seconds) sync_interval: 30 # Full sync interval (seconds) heartbeat_ttl: 15 # Heartbeat timeout (seconds) suspect_threshold: 2 # Missed heartbeats before marking as suspect offline_threshold: 3 # Missed heartbeats before marking as offline
routing: cpu_threshold: 85.0 # CPU usage threshold for triggering redirect (%) bandwidth_threshold: 8000.0 # Bandwidth threshold for triggering redirect (Mbps) subscriber_threshold: 2000 # Subscriber count threshold for triggering redirect
allocation: ticket_ttl: 30 # Allocation ticket TTL (seconds) candidate_limit: 3 # Number of candidate nodes per allocation
relay: establish_timeout: 10 # Relay connection establishment timeout (seconds) release_delay: 10 # Idle relay release delay (seconds) max_retry_attempts: 3 # Maximum retry attempts retry_delay: 2 # Retry interval (seconds) health_check_interval: 5 # Health check interval (seconds)
role: [] # Role configuration (optional)Key Configuration Details
Section titled “Key Configuration Details”sync — Node Synchronization
Section titled “sync — Node Synchronization”| Field | Description |
|---|---|
server_id | Unique node identifier in the cluster; use meaningful names like origin-1 or edge-beijing-1 |
address | This node’s HTTP control-plane address (host:port, usually global.http.listenaddr) for /cluster/api/heartbeat |
seed_servers | Seed nodes’ HTTP control addresses; do not use global.tcp.listenaddr (gRPC, e.g. :50051) |
heartbeat_interval | Heartbeat send interval for probing node liveness |
suspect_threshold | After this many consecutive missed heartbeat responses, the node is marked as Suspect |
offline_threshold | After this many consecutive missed heartbeat responses, the node is marked as Offline, triggering session invalidation and relay cleanup |
routing — Routing Strategy
Section titled “routing — Routing Strategy”When a node’s load exceeds the threshold, the Redirect Manager redirects new subscription requests to less loaded nodes:
- CPU threshold: Redirect is triggered when node CPU usage exceeds
cpu_threshold - Bandwidth threshold: Redirect is triggered when node egress bandwidth exceeds
bandwidth_threshold - Subscriber threshold: Redirect is triggered when the subscriber count on a single node exceeds
subscriber_threshold
relay — Stream Relay
Section titled “relay — Stream Relay”When an Edge node pulls a stream from an Origin node, a connection is established through the Relay mechanism. Configuration options control connection timeout, retry strategy, and health check frequency.
Core Components
Section titled “Core Components”The Cluster plugin consists of the following managers working together:
| Component | Responsibility |
|---|---|
| SyncManager | Node discovery, heartbeat detection, state synchronization |
| TransportManager | QUIC transport layer management, maintaining inter-node connections |
| SessionRegistry | Stream session registry, tracking which node publishes which stream |
| RelayManager | Stream relay management, establishing and maintaining cross-node stream transport |
| AllocationManager | Resource allocation, selecting the optimal node for new requests |
| RedirectManager | Load balancing redirect decisions |
| ClusterHooks | Event hooks, responding to stream publish/unpublish/subscribe/unsubscribe events |
| Cluster HTTP API | Control-plane REST (/cluster/api/*): heartbeat, nodes, sessions |
Workflow
Section titled “Workflow”Stream Publishing
Section titled “Stream Publishing”- A publisher pushes a stream to the Origin node
- ClusterHooks listens for
StreamEvent::Createdand registers stream information in the SessionRegistry - SyncManager synchronizes stream information to other cluster nodes via heartbeats
Stream Subscription (Edge Pull-from-Origin)
Section titled “Stream Subscription (Edge Pull-from-Origin)”- A pull request arrives at an Edge node
- The Edge does not find the stream locally and queries the SessionRegistry to determine the Origin node
- RelayManager establishes a Relay connection to the Origin via QUIC
- Stream data from the Origin is transmitted to the Edge via Relay, and the Edge distributes it to clients
Node Failure
Section titled “Node Failure”- SyncManager detects a node’s heartbeat timeout
- After reaching
suspect_threshold, the node is marked as suspect; after reachingoffline_threshold, it is marked as offline - SessionRegistry invalidates all stream sessions for that node
- RelayManager cleans up all Relay connections for that node
- When clients make new requests, AllocationManager assigns an available node
Deployment Examples
Section titled “Deployment Examples”Minimal Cluster (1 Origin + 2 Edge)
Section titled “Minimal Cluster (1 Origin + 2 Edge)”Origin node configuration (node-1):
cluster: sync: enabled: true server_id: "origin-1" address: "10.0.0.1:8180" seed_servers: ["10.0.0.1:8180"]Edge node configuration (node-2):
cluster: sync: enabled: true server_id: "edge-1" address: "10.0.0.2:8180" seed_servers: ["10.0.0.1:8180"]Edge node configuration (node-3):
cluster: sync: enabled: true server_id: "edge-2" address: "10.0.0.3:8180" seed_servers: ["10.0.0.1:8180"]For more detailed cluster architecture design, refer to Cluster Architecture.
HTTP API (cluster management)
Section titled “HTTP API (cluster management)”Default base URL: http://localhost:8180, route prefix: /cluster/api/
Standard success response:
{ "code": 0, "message": "success", "data": {}}1) Local node status
Section titled “1) Local node status”GET /cluster/api/status/localResponse (example):
{ "code": 0, "message": "success", "data": { "server_id": "node-1", "address": "192.168.1.10:8180", "has_sync": true, "node_count": 3, "session_count": 25, "relay_session_count": 2, "active_relay_sessions": 1 }}Optional sync.internal_api_key: when set, all /cluster/api/* requests must include header X-Cluster-Api-Key (or Authorization: Bearer <key>).
2) Server list
Section titled “2) Server list”GET /cluster/api/serversReturns all visible node summaries.
3) Heartbeat report
Section titled “3) Heartbeat report”POST /cluster/api/heartbeatContent-Type: application/jsonRequest body:
{ "summary": { "server_id": "node-2", "address": "192.168.1.11:8180" }}| Field | Type | Required | Description |
|---|---|---|---|
summary.server_id | string | Yes | Node ID |
summary.address | string | Yes | Node address |
Response data (v6):
{ "summaries": [ { "server_id": "origin-1", "address": "192.168.1.10:8180", "quic_addr": "..." } ]}Edge nodes should merge summaries and sessions from the response, periodically GET /cluster/api/servers and GET /cluster/api/sessions from seeds on sync_interval, push session updates via POST /cluster/api/sessions/sync after publish/unpublish, and push subscribe/unsubscribe deltas via POST /cluster/api/sessions/participation (action: subscribe or unsubscribe).
4) Full server sync
Section titled “4) Full server sync”POST /cluster/api/servers/syncContent-Type: application/jsonRequest body:
{ "summaries": [ { "server_id": "node-1", "address": "192.168.1.10:8180" }, { "server_id": "node-2", "address": "192.168.1.11:8180" } ]}5) Session list
Section titled “5) Session list”GET /cluster/api/sessionsReturns cluster stream sessions (sorted by stream_path).
6) Subscribe participation sync
Section titled “6) Subscribe participation sync”POST /cluster/api/sessions/participationContent-Type: application/json{ "stream_path": "live/test", "node_id": "edge-1", "action": "subscribe"}action is subscribe or unsubscribe. Remote node_id values update subscribe_nodes on the receiver. When node_id equals the local server_id and the stream is published on a remote origin, the receiver starts (or reuses) a relay session via attach_or_ensure_relay_session.
Participation events are fanned out to all known peer HTTP addresses from the node catalog (not only seed_servers).
7) Edge relay pull (Origin)
Section titled “7) Edge relay pull (Origin)”POST /cluster/api/relay/pullContent-Type: application/jsonRequest body (Edge → Origin):
{ "request_id": "uuid", "stream_path": "live/test", "requestor_id": "edge-1", "quic_addr": "192.168.1.11:44944", "session_id": "uuid", "timestamp": 1710000000000}Response data is PullStreamResponse: when accepted is true, Origin accepts the pull and starts QUIC relay; quic_addr is Origin’s QUIC endpoint. Origin also sends FLV sequence headers as a metadata frame before media.
8) Local relay session list
Section titled “8) Local relay session list”GET /cluster/api/relay/sessionsReturns relay sessions managed on this node (state, subscriber_count, frame_count, source_quic_addr, etc.) for the Admin cluster Relay tab.
9) Publish allocation decision
Section titled “9) Publish allocation decision”POST /cluster/api/allocate/publishContent-Type: application/json10) Play allocation decision
Section titled “10) Play allocation decision”POST /cluster/api/allocate/playContent-Type: application/jsonBoth endpoints use the same request body:
{ "stream_path": "live/test", "protocol": "webrtc", "limit": 3}| Field | Type | Required | Description |
|---|---|---|---|
stream_path | string | Yes | Stream path |
protocol | string | No | Protocol name, defaults to unknown |
limit | int | No | Candidate node limit |
11) Redirect decision
Section titled “11) Redirect decision”GET /cluster/api/redirect/decision?stream_path=live/test&protocol=webrtc| Parameter | Type | Required | Description |
|---|---|---|---|
stream_path | string | Yes | Target stream path |
protocol | string | No | Protocol type |
Error responses
Section titled “Error responses”| HTTP Status | Scenario |
|---|---|
400 | Invalid body or missing stream_path |
404 | Route not found |
401 | Missing or invalid X-Cluster-Api-Key when internal_api_key is configured |
503 | Allocation/redirect manager unavailable |