system-designwebrtcvideo-conferencingsoftware-architecture

System Design Behind Multi-Conference Video Calls

Dive deep into the architectural patterns powering popular video conferencing apps like Google Meet and Zoom. Learn about WebRTC, Peer-to-Peer (P2P), Mesh P2P, Multi-point Control Unit (MCU), and the highly scalable Selective Forwarding Unit (SFU) architectures.

January 5, 2026

Introduction

Video conferencing has become an integral part of our daily lives, from team meetings on Google Meet to family calls on Zoom. But have you ever wondered about the complex system design that makes these real-time, multi-party interactions possible? Let's dive deep into the architectural patterns behind these applications, exploring WebRTC, Peer-to-Peer (P2P), Multi-point Control Unit (MCU), and the Selective Forwarding Unit (SFU).

What are Multi-Conference Video Applications?

Simply put, a multi-conference video application allows more than two people to engage in a real-time video and audio call. Think of platforms like:

Zoom
Google Meet
Microsoft Teams
Discord

Designing such systems is no trivial task, as they demand high performance for real-time audio and video transmission and robust architectural patterns to handle scalability.

The Foundation: WebRTC and Peer-to-Peer (P2P)

At the heart of many modern real-time communication systems is WebRTC (Web Real-Time Communications). WebRTC enables direct Peer-to-Peer (P2P) communication, allowing two users (or "peers") to send and receive audio and video streams directly without an intermediary server.

Key Characteristics of P2P:

Direct Connection: P2P establishes a direct link between two peers
UDP Protocol: It primarily uses the User Datagram Protocol (UDP) for faster data transmission, crucial for real-time media
Cost-Effective: With no server involvement for media relay, P2P is almost free, incurring only bandwidth costs for the users' internet

While excellent for one-on-one calls, P2P has significant limitations when it comes to multi-party conferences.

The Mesh P2P: A Scalability Challenge

When more than two peers need to communicate using WebRTC, a common initial thought is Mesh P2P. In this model, every peer establishes a direct P2P connection with every other peer in the conference.

How Mesh P2P Works:

       Peer A
      /      \
     /        \
Peer B -------- Peer C
     \        /
      \      /
       Peer D

However, Mesh P2P quickly becomes impractical and unscalable:

Problems with Mesh P2P:

3 Participants: 2 connections per peer, 6 total streams
5 Participants: 4 connections per peer, 20 total streams
10 Participants: 9 connections per peer, 90 total streams
20 Participants: 19 connections per peer, 380 total streams

Connection Overload: As more peers join, the number of connections grows exponentially. With 'N' participants, each peer needs to establish N-1 connections, sending N-1 outgoing streams and receiving N-1 incoming streams
High Client Load: Each client bears a heavy burden of encoding, decoding, and sending/receiving multiple streams simultaneously, leading to performance issues and frequent crashes
Debugging Difficulty: The distributed nature of connections makes debugging and fault tolerance extremely challenging

This approach is simply not a scalable solution for applications like Google Meet or Zoom.

Centralized Solution #1: Multi-point Control Unit (MCU)

Recognizing the limitations of Mesh P2P, the industry introduced server-based solutions. One of the first was the Multi-point Control Unit (MCU).

How MCU Works:

In an MCU architecture, all participants send their audio and video streams to a central server. The MCU server then:

Receives all individual audio and video streams
Mixes all streams together
Re-encodes into a single, combined audio/video stream
Broadcasts the combined stream back to all participants

Peer A ──────┐
             │
Peer B ──────┼──► MCU Server ──► Mixed Stream ──► All Peers
             │    (Mix & Encode)
Peer C ──────┘

Advantages of MCU:

Simplified Client-Side: Each client only needs to make one connection to the server
Low Bandwidth for Clients: Clients receive only one combined stream regardless of participant count
Consistent Experience: All clients receive the same quality stream

Disadvantages of MCU:

CPU-Intensive Server: The major drawback is the immense processing power required to mix and re-encode multiple real-time video streams
High Server Costs: Significant computational resources are needed
Potential Latency: The mixing process can introduce noticeable lag
No Individual Control: Clients cannot independently control individual streams (e.g., muting a specific person for yourself, pinning a speaker)

An example of MCU in action could be a YouTube live stream where multiple speakers are combined into a single broadcast feed.

Centralized Solution #2: Selective Forwarding Unit (SFU)

The most popular and scalable architecture for multi-conference video calls today is the Selective Forwarding Unit (SFU). SFU addresses the limitations of both Mesh P2P and MCU.

How SFU Works:

In an SFU model, each peer sends their individual audio and video stream to a central SFU server. Unlike the MCU, the SFU server does not mix or re-encode these streams. Instead, it acts as a "pass-through" or "tunnel," selectively forwarding individual raw streams from one peer to all other relevant peers.

Peer A ──────┐                    ┌──► Peer A (receives B, C streams)
             │                    │
Peer B ──────┼──► SFU Server ─────┼──► Peer B (receives A, C streams)
             │    (Forward Only)  │
Peer C ──────┘                    └──► Peer C (receives A, B streams)

Key Advantages of SFU:

| Feature | MCU | SFU | | --------------- | --------- | --------- | | Server CPU Load | Very High | Low | | Client Control | None | Full | | Latency | Higher | Lower | | Scalability | Limited | Excellent | | Stream Quality | Uniform | Variable |

Lower Server CPU Load: The SFU server doesn't perform computationally expensive mixing, significantly reducing its CPU footprint
Client-Side Control: Because clients receive individual streams, they have the flexibility to:
- Render each participant separately
- Mute specific participants locally
- Pin a specific person to full screen
- Adjust individual stream quality
Scalability: SFUs are highly scalable as they distribute the rendering burden to client devices
Selective Forwarding: The "Selective" in SFU refers to the server's ability to decide which streams to forward to which participants based on their meeting context or specific needs

Why Google Meet and Zoom Use SFU:

This is why you can:

Pin a specific person to full screen
Mute someone only for yourself
See individual video tiles for each participant
Have different quality streams based on your connection

Comparison: P2P vs MCU vs SFU

| Aspect | Mesh P2P | MCU | SFU | | ------------------------- | ---------------- | ------------ | ------------------ | | Server Required | No | Yes | Yes | | Server CPU Usage | N/A | Very High | Low | | Client CPU Usage | Very High | Low | Medium | | Scalability | Poor (3-5 users) | Moderate | Excellent | | Latency | Lowest | Highest | Low | | Individual Stream Control | Yes | No | Yes | | Cost | Lowest | Highest | Moderate | | Best Use Case | 1-on-1 calls | Broadcasting | Video conferencing |

Implementing Your Own SFU

If you're looking to implement your own SFU, here are some excellent open-source libraries:

MediaSoup (Node.js)

MediaSoup is a low-level Selective Forwarding Unit framework designed to be integrated into Node.js servers. It's powerful but requires significant engineering effort.

// Basic MediaSoup setup example
const mediasoup = require('mediasoup');

const worker = await mediasoup.createWorker({
  logLevel: 'warn',
  rtcMinPort: 10000,
  rtcMaxPort: 10100,
});

const router = await worker.createRouter({
  mediaCodecs: [
    {
      kind: 'audio',
      mimeType: 'audio/opus',
      clockRate: 48000,
      channels: 2,
    },
    {
      kind: 'video',
      mimeType: 'video/VP8',
      clockRate: 90000,
    },
  ],
});

Other SFU Libraries:

Janus (C) — Full-featured WebRTC server
Jitsi Videobridge (Java) — Powers Jitsi Meet
LiveKit (Go) — Modern, scalable SFU
Pion (Go) — Pure Go WebRTC implementation

Advanced Considerations

Simulcast

Modern SFUs often implement Simulcast, where each client sends multiple quality versions of their stream (e.g., 720p, 360p, 180p). The SFU can then forward the appropriate quality to each recipient based on their:

Network conditions
Screen size
Whether they're the active speaker

SVC (Scalable Video Coding)

An alternative to Simulcast is SVC, which encodes video in layers that can be selectively decoded. This provides more flexibility but requires codec support.

Cascading SFUs

For global scale, multiple SFU servers can be cascaded across different regions, with servers forwarding streams to each other to minimize latency for geographically distributed participants.

                 ┌─── SFU (US-West) ◄── Users in California
                 │
Global Router ───┼─── SFU (US-East) ◄── Users in New York
                 │
                 └─── SFU (Europe) ◄── Users in London

Conclusion

Understanding the evolution from simple P2P to complex SFU architectures is crucial for appreciating the robustness of modern video conferencing applications:

WebRTC/P2P: Great for one-on-one calls, but doesn't scale
Mesh P2P: Theoretically works for groups, but practically fails beyond 3-5 participants
MCU: Centralizes processing but is expensive and limits client control
SFU: The sweet spot — scalable, flexible, and powers most modern conferencing apps

The SFU model stands out as the most efficient and flexible solution, enabling the seamless, interactive video call experiences we've come to expect from platforms like Google Meet and Zoom.

The next time you join a video call with colleagues spread across the globe, you'll know about the sophisticated dance of WebRTC streams, SFU servers, and selective forwarding happening behind the scenes — all orchestrated to make your meeting experience smooth and interactive.