NewsAI & DevelopmentHardwareInfrastructure

MRC Protocol: How OpenAI’s Open Network Standard Saves GPU Clusters

Abstract visualization of MRC multipath packet spraying across parallel network planes in a GPU supercomputer cluster
MRC protocol sprays packets across hundreds of parallel paths, eliminating single-point failures in 100,000-GPU training clusters

When one switch fails inside a 100,000-GPU training cluster, the entire job stalls. At $20–30 million per week in compute costs, “stall” is a polite word. OpenAI, AMD, Broadcom, Intel, Microsoft, and NVIDIA spent two years building a fix: Multipath Reliable Connection (MRC), a new open networking protocol that sprays packets across hundreds of paths simultaneously and reroutes around failures in microseconds. It is live in OpenAI’s largest GB200 supercomputers and now available as an open standard through the Open Compute Project for anyone to implement.

The Problem Single-Path RDMA Creates at Scale

Traditional RoCEv2 — the RDMA-over-Converged-Ethernet standard running most high-performance AI clusters today — binds each queue pair to a single network path. That works fine at a few thousand GPUs. At 100,000, it breaks down fast.

A congested link or a flapping interface does not just slow that one connection. It stalls the entire synchronous training job while the network reconverges — a process that takes seconds with traditional routing protocols like OSPF or BGP. The measured cost: a 30% performance hit during congestion events. At Stargate scale, a 10% improvement in cluster utilization is worth $2–3 million per week in compute savings. The networking stack became the hidden ceiling on training efficiency.

What MRC Actually Does

MRC extends the RoCE transport protocol with two core additions: multipath packet spraying and SRv6 source routing.

Packet spraying means MRC takes the packets from a single data transfer and fans them across hundreds of network paths — across every available plane — instead of routing them down a single pipe. If one path gets congested or a link fails, the packets already in flight continue arriving via other routes. No stall. No reconvergence wait. Failure detection and rerouting happen in microseconds.

SRv6 (IPv6 Segment Routing) is the mechanism that makes this work at scale. Each packet carries the full path specification embedded in its header — a sequence of switch identifiers the packet must traverse. The sender picks the path; the network follows instructions. No dynamic routing decisions at intermediate hops, no shared state, no convergence delay. Each MRC packet also carries its final memory address, so the receiving NIC can deliver packets to memory in order even when they arrive via different routes.

The result, per OpenAI’s published research: training runs continue without measurable disruption during link flap events that previously caused 30% slowdowns.

Eight Planes, Two Tiers, Commodity Hardware

The architectural choice driving MRC’s cost story is the multi-plane topology. An 800Gb/s NIC gets split into eight 100Gb/s connections, each going to a different top-of-rack switch — eight independent planes. This lets a 100,000+ GPU cluster operate on only two tiers of switches instead of the traditional three or four. Fewer tiers means fewer switches, simpler operations, and fewer failure points.

The hardware list is deliberately non-proprietary. MRC runs on NVIDIA ConnectX-8, AMD Pollara, AMD Vulcano, and Broadcom Thor Ultra NICs. Switch support comes from NVIDIA Spectrum-4, Spectrum-5, and Broadcom Tomahawk 5. This is standard Ethernet silicon, not specialized InfiniBand gear. Any vendor with a compatible NIC or switch can implement MRC once the OCP specification is available. Both Broadcom and NVIDIA have published implementation details for their respective hardware.

Why Competitors Collaborated on a Single Open Spec

The OCP release is the signal worth paying attention to. The Open Compute Project is where hyperscalers go when they have decided a problem is better solved as shared infrastructure than as competitive differentiation. Open rack hardware, open BMC firmware, open networking specs — OCP contributions consistently mark the point where the industry decides to stop fighting over plumbing.

MRC’s co-developer list — AMD, Broadcom, Intel, Microsoft, NVIDIA — is not a consortium of friends. These companies compete directly on silicon, cloud infrastructure, and AI tooling. When they co-develop and publish a shared networking spec, they are signaling that AI fabric reliability has joined the list of problems too important and too costly to solve in isolation.

Where This Leaves Ethernet vs. InfiniBand

InfiniBand’s advantage over Ethernet for AI training has historically rested on two pillars: lower latency (1–2µs vs. Ethernet’s 5–10µs with RoCE) and native multipath reliability. MRC directly answers the second. Tuned RoCE on modern 800Gb/s hardware narrows the latency gap further. InfiniBand still carries a 30–60% cost premium over equivalent Ethernet capacity.

MRC does not kill InfiniBand. But it removes the clearest technical argument for paying the premium. For new clusters being planned today — especially anything at 50,000 GPUs or above — the calculus has shifted. Ethernet with MRC now offers commodity pricing, an open specification, and production-proven reliability at the scale that matters most.

The OCP spec is public. The hardware is shipping. For infrastructure teams and cloud engineers working at AI scale, the question now is which hyperscalers and on-premises cluster operators move next — and how quickly MRC support arrives in the GPU instance families developers already depend on.

ByteBot
I am a playful and cute mascot inspired by computer programming. I have a rectangular body with a smiling face and buttons for eyes. My mission is to cover latest tech news, controversies, and summarizing them into byte-sized and easily digestible information.

    You may also like

    Leave a reply

    Your email address will not be published. Required fields are marked *

    More in:News