OpenAI & Partners Launch MRC: The Open AI Networking Protocol Redefining GPU-Scale Training

Latest Cisco, PMP, AWS, CompTIA, Microsoft Materials on SALE Get Now

SPOTO AI 2026-05-13 10:56:23

Background: Why AI Networking Needed a New Protocol

On May 5, 2026, OpenAI published a landmark engineering announcement: the release of Multipath Reliable Connection (MRC), a new open networking protocol co-developed with AMD, Broadcom, Intel, Microsoft, and NVIDIA. The release marks a pivotal moment in AI infrastructure engineering.

Training frontier AI models requires clusters containing hundreds of thousands of GPUs working in tight synchronization. A single step in model training can involve many millions of data transfers—and one late transfer can stall an entire job, leaving thousands of expensive GPUs idle. Traditional Ethernet protocols, specifically RoCEv2 (RDMA over Converged Ethernet), route all data between two points over a single fixed path. As clusters scale up, a single congested link or failed switch can bring an entire training run to a halt or force a costly restart from a saved checkpoint.

What Is MRC?

MRC stands for Multipath Reliable Connection. It is a new network transport protocol built into the latest 800 Gb/s network interfaces. MRC extends RoCEv2 and draws on techniques developed by the Ultra Ethernet Consortium (UEC), combining them with SRv6-based source routing to support large-scale AI networking fabrics. The result is a protocol that can spread a single transfer across hundreds of paths, route around failures in microseconds, and run simpler network control planes.

MRC directly addresses two critical failure modes in large AI clusters: traffic congestion and link/switch failures. It is already deployed in production and has been used to train multiple OpenAI frontier models.

How MRC Works

MRC replaces single-path data transfer with intelligent multipath packet distribution. Key mechanisms include:

Adaptive Packet Spraying: Instead of sending all packets along one path, MRC distributes them across multiple paths simultaneously. This virtually eliminates core congestion and reduces GPU idle time during synchronized training sessions.
Multiplanar Network Design: Rather than treating one 800 Gb/s interface as a single link, MRC splits it into multiple smaller links—for example, eight parallel 100 Gb/s networks (planes). Each plane provides a complete east-west path between all GPUs, delivering redundancy and boosting switch radix efficiency.
Microsecond Path Failover: When MRC detects packet loss on a path, it immediately stops using that path and reroutes traffic. Training jobs can survive link flaps and even live switch reboots without measurable disruption—previously, a single failure would crash an entire job.
Packet Trimming: When a switch would drop a packet due to buffer pressure, MRC trims the payload and forwards only the header to the destination. This triggers an explicit retransmission request and avoids false-positive path failure assumptions.
Static Source Routing (SRv6): OpenAI eliminated dynamic routing protocols such as BGP in favor of IPv6 Segment Routing. The sender encodes the full route—including switch identifiers—directly into the destination address, eliminating entire classes of routing failures.
High-Frequency Telemetry: MRC includes continuous reporting of network conditions such as congestion signals, packet loss, and path utilization, enabling real-time microsecond-level routing decisions.

A key architectural advantage: MRC's multipath design allows a two-tier Ethernet switch topology to connect more than 100,000 GPUs—a configuration that conventional 800 Gb/s networks require three or four switch tiers to achieve. This reduces power consumption, component count, and network costs at scale.

Production Deployments

MRC is not theoretical. It is deployed across all of OpenAI's largest NVIDIA GB200 supercomputers used to train frontier models

Sources