
System Design Behind Multi-Conference Video Calls
Dive deep into the architectural patterns powering popular video conferencing apps like Google Meet and Zoom. Learn about WebRTC, Peer-to-Peer (P2P), Mesh P2P, Multi-point Control Unit (MCU), and the highly scalable Selective Forwarding Unit (SFU) architectures.
Introduction
Video conferencing has become an integral part of our daily lives, from team meetings on Google Meet to family calls on Zoom. But have you ever wondered about the complex system design that makes these real-time, multi-party interactions possible? Let's dive deep into the architectural patterns behind these applications, exploring WebRTC, Peer-to-Peer (P2P), Multi-point Control Unit (MCU), and the Selective Forwarding Unit (SFU).
What are Multi-Conference Video Applications?
Simply put, a multi-conference video application allows more than two people to engage in a real-time video and audio call. Think of platforms like:
- Zoom
- Google Meet
- Microsoft Teams
- Discord
Designing such systems is no trivial task, as they demand high performance for real-time audio and video transmission and robust architectural patterns to handle scalability.
The Foundation: WebRTC and Peer-to-Peer (P2P)
At the heart of many modern real-time communication systems is WebRTC (Web Real-Time Communications). WebRTC enables direct Peer-to-Peer (P2P) communication, allowing two users (or "peers") to send and receive audio and video streams directly without an intermediary server.
Key Characteristics of P2P:
- Direct Connection: P2P establishes a direct link between two peers
- UDP Protocol: It primarily uses the User Datagram Protocol (UDP) for faster data transmission, crucial for real-time media
- Cost-Effective: With no server involvement for media relay, P2P is almost free, incurring only bandwidth costs for the users' internet
While excellent for one-on-one calls, P2P has significant limitations when it comes to multi-party conferences.
The Mesh P2P: A Scalability Challenge
When more than two peers need to communicate using WebRTC, a common initial thought is Mesh P2P. In this model, every peer establishes a direct P2P connection with every other peer in the conference.
How Mesh P2P Works:
Peer A
/ \
/ \
Peer B -------- Peer C
\ /
\ /
Peer D
However, Mesh P2P quickly becomes impractical and unscalable:
Problems with Mesh P2P:
3 Participants: 2 connections per peer, 6 total streams
5 Participants: 4 connections per peer, 20 total streams
10 Participants: 9 connections per peer, 90 total streams
20 Participants: 19 connections per peer, 380 total streams
- Connection Overload: As more peers join, the number of connections grows exponentially. With 'N' participants, each peer needs to establish N-1 connections, sending N-1 outgoing streams and receiving N-1 incoming streams
- High Client Load: Each client bears a heavy burden of encoding, decoding, and sending/receiving multiple streams simultaneously, leading to performance issues and frequent crashes
- Debugging Difficulty: The distributed nature of connections makes debugging and fault tolerance extremely challenging
This approach is simply not a scalable solution for applications like Google Meet or Zoom.
Centralized Solution #1: Multi-point Control Unit (MCU)
Recognizing the limitations of Mesh P2P, the industry introduced server-based solutions. One of the first was the Multi-point Control Unit (MCU).
How MCU Works:
In an MCU architecture, all participants send their audio and video streams to a central server. The MCU server then:
- Receives all individual audio and video streams
- Mixes all streams together
- Re-encodes into a single, combined audio/video stream
- Broadcasts the combined stream back to all participants
Peer A ──────┐
│
Peer B ──────┼──► MCU Server ──► Mixed Stream ──► All Peers
│ (Mix & Encode)
Peer C ──────┘
Advantages of MCU:
- Simplified Client-Side: Each client only needs to make one connection to the server
- Low Bandwidth for Clients: Clients receive only one combined stream regardless of participant count
- Consistent Experience: All clients receive the same quality stream
Disadvantages of MCU:
- CPU-Intensive Server: The major drawback is the immense processing power required to mix and re-encode multiple real-time video streams
- High Server Costs: Significant computational resources are needed
- Potential Latency: The mixing process can introduce noticeable lag
- No Individual Control: Clients cannot independently control individual streams (e.g., muting a specific person for yourself, pinning a speaker)
An example of MCU in action could be a YouTube live stream where multiple speakers are combined into a single broadcast feed.
Centralized Solution #2: Selective Forwarding Unit (SFU)
The most popular and scalable architecture for multi-conference video calls today is the Selective Forwarding Unit (SFU). SFU addresses the limitations of both Mesh P2P and MCU.
How SFU Works:
In an SFU model, each peer sends their individual audio and video stream to a central SFU server. Unlike the MCU, the SFU server does not mix or re-encode these streams. Instead, it acts as a "pass-through" or "tunnel," selectively forwarding individual raw streams from one peer to all other relevant peers.
Peer A ──────┐ ┌──► Peer A (receives B, C streams)
│ │
Peer B ──────┼──► SFU Server ─────┼──► Peer B (receives A, C streams)
│ (Forward Only) │
Peer C ──────┘ └──► Peer C (receives A, B streams)
Key Advantages of SFU:
| Feature | MCU | SFU | | --------------- | --------- | --------- | | Server CPU Load | Very High | Low | | Client Control | None | Full | | Latency | Higher | Lower | | Scalability | Limited | Excellent | | Stream Quality | Uniform | Variable |
- Lower Server CPU Load: The SFU server doesn't perform computationally expensive mixing, significantly reducing its CPU footprint
- Client-Side Control: Because clients receive individual streams, they have the flexibility to:
- Render each participant separately
- Mute specific participants locally
- Pin a specific person to full screen
- Adjust individual stream quality
- Scalability: SFUs are highly scalable as they distribute the rendering burden to client devices
- Selective Forwarding: The "Selective" in SFU refers to the server's ability to decide which streams to forward to which participants based on their meeting context or specific needs
Why Google Meet and Zoom Use SFU:
This is why you can:
- Pin a specific person to full screen
- Mute someone only for yourself
- See individual video tiles for each participant
- Have different quality streams based on your connection
Comparison: P2P vs MCU vs SFU
| Aspect | Mesh P2P | MCU | SFU | | ------------------------- | ---------------- | ------------ | ------------------ | | Server Required | No | Yes | Yes | | Server CPU Usage | N/A | Very High | Low | | Client CPU Usage | Very High | Low | Medium | | Scalability | Poor (3-5 users) | Moderate | Excellent | | Latency | Lowest | Highest | Low | | Individual Stream Control | Yes | No | Yes | | Cost | Lowest | Highest | Moderate | | Best Use Case | 1-on-1 calls | Broadcasting | Video conferencing |
Implementing Your Own SFU
If you're looking to implement your own SFU, here are some excellent open-source libraries:
MediaSoup (Node.js)
MediaSoup is a low-level Selective Forwarding Unit framework designed to be integrated into Node.js servers. It's powerful but requires significant engineering effort.
// Basic MediaSoup setup example
const mediasoup = require('mediasoup');
const worker = await mediasoup.createWorker({
logLevel: 'warn',
rtcMinPort: 10000,
rtcMaxPort: 10100,
});
const router = await worker.createRouter({
mediaCodecs: [
{
kind: 'audio',
mimeType: 'audio/opus',
clockRate: 48000,
channels: 2,
},
{
kind: 'video',
mimeType: 'video/VP8',
clockRate: 90000,
},
],
});Other SFU Libraries:
- Janus (C) — Full-featured WebRTC server
- Jitsi Videobridge (Java) — Powers Jitsi Meet
- LiveKit (Go) — Modern, scalable SFU
- Pion (Go) — Pure Go WebRTC implementation
Advanced Considerations
Simulcast
Modern SFUs often implement Simulcast, where each client sends multiple quality versions of their stream (e.g., 720p, 360p, 180p). The SFU can then forward the appropriate quality to each recipient based on their:
- Network conditions
- Screen size
- Whether they're the active speaker
SVC (Scalable Video Coding)
An alternative to Simulcast is SVC, which encodes video in layers that can be selectively decoded. This provides more flexibility but requires codec support.
Cascading SFUs
For global scale, multiple SFU servers can be cascaded across different regions, with servers forwarding streams to each other to minimize latency for geographically distributed participants.
┌─── SFU (US-West) ◄── Users in California
│
Global Router ───┼─── SFU (US-East) ◄── Users in New York
│
└─── SFU (Europe) ◄── Users in London
Conclusion
Understanding the evolution from simple P2P to complex SFU architectures is crucial for appreciating the robustness of modern video conferencing applications:
- WebRTC/P2P: Great for one-on-one calls, but doesn't scale
- Mesh P2P: Theoretically works for groups, but practically fails beyond 3-5 participants
- MCU: Centralizes processing but is expensive and limits client control
- SFU: The sweet spot — scalable, flexible, and powers most modern conferencing apps
The SFU model stands out as the most efficient and flexible solution, enabling the seamless, interactive video call experiences we've come to expect from platforms like Google Meet and Zoom.
The next time you join a video call with colleagues spread across the globe, you'll know about the sophisticated dance of WebRTC streams, SFU servers, and selective forwarding happening behind the scenes — all orchestrated to make your meeting experience smooth and interactive.
