Performance Tuning

This document describes the key performance parameters of Monibuca V6 and how to tune them, helping you achieve optimal performance across different deployment scales.

Dispatcher Workers Tuning

dispatcher_workers controls the number of stream dispatch threads and is the most critical parameter affecting throughput and latency.

dispatcher_workers: 0    # 0 = auto (CPU core count)

Operating Modes

Value	Behavior	Use Case
`0`	Auto-detect CPU core count and create an equal number of threads	Recommended default, suitable for most scenarios
`1`	Single-threaded dispatch	Low concurrency, latency-first scenarios
`N`	Specify N dispatch threads	Precise resource allocation control

Tuning Recommendations

Small scale (< 100 streams): dispatcher_workers: 0 is sufficient
Medium scale (100-1000 streams): Set to 50%-75% of CPU core count, leaving headroom for other tasks
Large scale (> 1000 streams): Set to CPU core count, ensuring dispatch doesn’t become a bottleneck
Ultra-low latency: Set to 1 to reduce thread switching overhead (at the cost of throughput)

# 8-core CPU, 1000 streams scenario
dispatcher_workers: 6

RingBuffer

RingBuffer is the core data structure for Monibuca’s zero-copy distribution. Each Track of each stream has its own independent RingBuffer.

# Global default configuration
publish:
  ring_size: 256    # RingBuffer capacity (in frames)

Capacity Selection

Capacity	Memory Usage (est./Track)	Use Case
`64`	~2 MB	Low-latency live streaming, tight memory
`128`	~4 MB	General live streaming scenarios
`256`	~8 MB	Recommended default, balancing latency and fault tolerance
`512`	~16 MB	High bitrate streams, multi-subscriber scenarios
`1024`	~32 MB	Ultra-high concurrency subscriptions

Memory Impact

Total memory estimation formula:

Memory ≈ Stream Count × Tracks per Stream × ring_size × Average Frame Size

For example: 100 streams × 2 Tracks × 256 × 16KB ≈ 800 MB

Bounded Channel Queue

Subscribers and dispatchers communicate through bounded channels.

subscribe:
  channel_size: 64    # Channel capacity per subscriber

Tuning Recommendations

Value	Description
`16`	Lowest latency, but prone to frame drops during network fluctuations
`32`	Lower latency, moderate buffering
`64`	Recommended default
`128`	High buffering, suitable for unstable network conditions

Behavior when the channel is full depends on the configured strategy:

Drop old frames: Maintains real-time performance, may cause visual jumps
Block and wait: Maintains completeness, may increase latency

System-Level Tuning

File Descriptors

Each network connection consumes a file descriptor. Large-scale scenarios require raising system limits.

# Check current limit
ulimit -n

# Temporary modification
ulimit -n 1000000

# Permanent modification in /etc/security/limits.conf
* soft nofile 1000000
* hard nofile 1000000

Estimation formula:

Required fd ≈ (Publishers + Subscribers) × 1.2 + Base Overhead (~1000)

TCP Parameter Tuning

# Increase TCP buffer sizes
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 87380 16777216

# Increase connection queue
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 65535

# Enable TCP fast recycling
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 15

# Increase port range
net.ipv4.ip_local_port_range = 1024 65535

# Increase network device receive queue
net.core.netdev_max_backlog = 65535

Apply the configuration:

sudo sysctl -p

CPU Affinity

For multi-core servers with NUMA architecture, binding processes to specific CPUs can reduce cross-NUMA node memory access:

# Bind to CPU 0-7
taskset -c 0-7 ./monibuca

Memory Allocator

Monibuca uses the system allocator by default. In high-concurrency scenarios, you can use jemalloc via LD_PRELOAD for better multi-threaded performance:

LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so ./monibuca

Performance Benchmark Reference

The following data is based on a test environment (Intel Xeon 8-core, 32GB RAM, 10GbE network):

Throughput

Scenario	Streams	Total Subscribers	CPU Usage	Memory Usage	Latency (P99)
RTMP→FLV	100	1,000	15%	800 MB	< 500ms
RTMP→FLV	500	5,000	45%	3.5 GB	< 800ms
RTMP→HLS	100	10,000	25%	1.2 GB	3-6s
RTMP→WebRTC	50	500	30%	1.0 GB	< 200ms
Mixed Protocols	200	2,000	35%	2.0 GB	Protocol-dependent

Single Stream Stress Test

Metric	Value
Max subscribers per stream	10,000+
Max bitrate per stream	50 Mbps
Max stream count	10,000+
Engine startup time	< 1s

Monitoring and Diagnostics

Real-Time Metrics

GET /api/sysinfo

Returns real-time metrics including CPU, memory, and network I/O. Combined with the report plugin, metrics can be exported to Prometheus/Grafana.

Debug Plugin

Enable the debug plugin for more detailed internal state:

features = ["debug"]

GET /debug/ringbuffer/{stream_path}
GET /debug/dispatcher
GET /debug/connections

Log Level

Dynamically adjust the log level at runtime:

PUT /config/global
Content-Type: application/json

{
  "log": { "level": "debug" }
}