Performance Tuning
This document describes the key performance parameters of Monibuca V6 and how to tune them, helping you achieve optimal performance across different deployment scales.
Dispatcher Workers Tuning
Section titled “Dispatcher Workers Tuning”dispatcher_workers controls the number of stream dispatch threads and is the most critical parameter affecting throughput and latency.
dispatcher_workers: 0 # 0 = auto (CPU core count)Operating Modes
Section titled “Operating Modes”| Value | Behavior | Use Case |
|---|---|---|
0 | Auto-detect CPU core count and create an equal number of threads | Recommended default, suitable for most scenarios |
1 | Single-threaded dispatch | Low concurrency, latency-first scenarios |
N | Specify N dispatch threads | Precise resource allocation control |
Tuning Recommendations
Section titled “Tuning Recommendations”- Small scale (< 100 streams):
dispatcher_workers: 0is sufficient - Medium scale (100-1000 streams): Set to 50%-75% of CPU core count, leaving headroom for other tasks
- Large scale (> 1000 streams): Set to CPU core count, ensuring dispatch doesn’t become a bottleneck
- Ultra-low latency: Set to
1to reduce thread switching overhead (at the cost of throughput)
# 8-core CPU, 1000 streams scenariodispatcher_workers: 6RingBuffer
Section titled “RingBuffer”RingBuffer is the core data structure for Monibuca’s zero-copy distribution. Each Track of each stream has its own independent RingBuffer.
# Global default configurationpublish: ring_size: 256 # RingBuffer capacity (in frames)Capacity Selection
Section titled “Capacity Selection”| Capacity | Memory Usage (est./Track) | Use Case |
|---|---|---|
64 | ~2 MB | Low-latency live streaming, tight memory |
128 | ~4 MB | General live streaming scenarios |
256 | ~8 MB | Recommended default, balancing latency and fault tolerance |
512 | ~16 MB | High bitrate streams, multi-subscriber scenarios |
1024 | ~32 MB | Ultra-high concurrency subscriptions |
Memory Impact
Section titled “Memory Impact”Total memory estimation formula:
Memory ≈ Stream Count × Tracks per Stream × ring_size × Average Frame SizeFor example: 100 streams × 2 Tracks × 256 × 16KB ≈ 800 MB
Bounded Channel Queue
Section titled “Bounded Channel Queue”Subscribers and dispatchers communicate through bounded channels.
subscribe: channel_size: 64 # Channel capacity per subscriberTuning Recommendations
Section titled “Tuning Recommendations”| Value | Description |
|---|---|
16 | Lowest latency, but prone to frame drops during network fluctuations |
32 | Lower latency, moderate buffering |
64 | Recommended default |
128 | High buffering, suitable for unstable network conditions |
Behavior when the channel is full depends on the configured strategy:
- Drop old frames: Maintains real-time performance, may cause visual jumps
- Block and wait: Maintains completeness, may increase latency
System-Level Tuning
Section titled “System-Level Tuning”File Descriptors
Section titled “File Descriptors”Each network connection consumes a file descriptor. Large-scale scenarios require raising system limits.
# Check current limitulimit -n
# Temporary modificationulimit -n 1000000
# Permanent modification in /etc/security/limits.conf* soft nofile 1000000* hard nofile 1000000Estimation formula:
Required fd ≈ (Publishers + Subscribers) × 1.2 + Base Overhead (~1000)TCP Parameter Tuning
Section titled “TCP Parameter Tuning”# Increase TCP buffer sizesnet.core.rmem_max = 16777216net.core.wmem_max = 16777216net.ipv4.tcp_rmem = 4096 87380 16777216net.ipv4.tcp_wmem = 4096 87380 16777216
# Increase connection queuenet.core.somaxconn = 65535net.ipv4.tcp_max_syn_backlog = 65535
# Enable TCP fast recyclingnet.ipv4.tcp_tw_reuse = 1net.ipv4.tcp_fin_timeout = 15
# Increase port rangenet.ipv4.ip_local_port_range = 1024 65535
# Increase network device receive queuenet.core.netdev_max_backlog = 65535Apply the configuration:
sudo sysctl -pCPU Affinity
Section titled “CPU Affinity”For multi-core servers with NUMA architecture, binding processes to specific CPUs can reduce cross-NUMA node memory access:
# Bind to CPU 0-7taskset -c 0-7 ./monibucaMemory Allocator
Section titled “Memory Allocator”Monibuca uses the system allocator by default. In high-concurrency scenarios, you can use jemalloc via LD_PRELOAD for better multi-threaded performance:
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so ./monibucaPerformance Benchmark Reference
Section titled “Performance Benchmark Reference”The following data is based on a test environment (Intel Xeon 8-core, 32GB RAM, 10GbE network):
Throughput
Section titled “Throughput”| Scenario | Streams | Total Subscribers | CPU Usage | Memory Usage | Latency (P99) |
|---|---|---|---|---|---|
| RTMP→FLV | 100 | 1,000 | 15% | 800 MB | < 500ms |
| RTMP→FLV | 500 | 5,000 | 45% | 3.5 GB | < 800ms |
| RTMP→HLS | 100 | 10,000 | 25% | 1.2 GB | 3-6s |
| RTMP→WebRTC | 50 | 500 | 30% | 1.0 GB | < 200ms |
| Mixed Protocols | 200 | 2,000 | 35% | 2.0 GB | Protocol-dependent |
Single Stream Stress Test
Section titled “Single Stream Stress Test”| Metric | Value |
|---|---|
| Max subscribers per stream | 10,000+ |
| Max bitrate per stream | 50 Mbps |
| Max stream count | 10,000+ |
| Engine startup time | < 1s |
Monitoring and Diagnostics
Section titled “Monitoring and Diagnostics”Real-Time Metrics
Section titled “Real-Time Metrics”GET /api/sysinfoReturns real-time metrics including CPU, memory, and network I/O. Combined with the report plugin, metrics can be exported to Prometheus/Grafana.
Debug Plugin
Section titled “Debug Plugin”Enable the debug plugin for more detailed internal state:
features = ["debug"]GET /debug/ringbuffer/{stream_path}GET /debug/dispatcherGET /debug/connectionsLog Level
Section titled “Log Level”Dynamically adjust the log level at runtime:
PUT /config/globalContent-Type: application/json
{ "log": { "level": "debug" }}