AI & DevelopmentCloud & DevOps

OpenAI MRC Protocol: Six Tech Giants Fix GPU Network Failures

On May 5, 2026, OpenAI announced MRC (Multipath Reliable Connection) in collaboration with AMD, Broadcom, Intel, Microsoft, and NVIDIA—an unprecedented alliance of competitors tackling a billion-dollar problem. This open-source networking protocol solves network failures that waste GPU compute time in large-scale AI training clusters. MRC spreads data across hundreds of parallel paths and routes around failures in microseconds, preventing thousands of GPUs from sitting idle when links go down. Moreover, the protocol is already deployed in production on OpenAI’s largest NVIDIA GB200 supercomputers, where it’s training GPT-5.5 and other frontier models.

The Billion-Dollar Problem: Network Failures at Scale

At 100,000-GPU scale—the frontier OpenAI, Microsoft, and others are pushing toward—network failures aren’t edge cases. They’re mathematically certain. With a 0.5% daily failure rate, that’s 500 GPUs failing every 24 hours. As Greg Brockman, OpenAI’s President, put it: “At Stargate scale, a single slow link can idle thousands of GPUs simultaneously.”

Traditional networks take 10 to 60 seconds to detect failures and reroute traffic via BGP or OSPF convergence. Consequently, during that window, thousands of synchronized GPUs sit idle, waiting. AI training relies on collective operations where every GPU must stay in lockstep—if one is delayed, all of them wait. Furthermore, multi-week or multi-month training runs guarantee multiple failures will occur, wasting millions in compute time annually.

This isn’t theoretical. It’s the reality of frontier AI infrastructure, and it’s why six competitors agreed to collaborate on a solution.

How MRC Works: Multipath and Microsecond Failover

MRC implements “packet spraying”—distributing a single connection’s packets across hundreds of parallel network paths instead of forcing everything through one fixed route. A typical deployment splits an 800Gb/s network interface into eight independent 100Gb/s logical “planes.” Instead of sending all traffic down one highway lane, MRC uses 100+ lanes simultaneously.

When a path fails or becomes congested, MRC’s NIC hardware detects it in microseconds and stops selecting that path, redistributing traffic across the remaining paths. No network-wide reconfiguration. No BGP convergence. Just immediate, hardware-accelerated failover that GPUs never notice.

Additionally, MRC extends RoCEv2 (RDMA over Converged Ethernet) with multipath capabilities and uses SRv6 source routing, where the sender specifies the path in each packet header. This eliminates complex routing protocol interactions and enables simpler network topologies—OpenAI can build 100,000+ GPU clusters with just two tiers of Ethernet switches instead of three or four, reducing power consumption and operational complexity.

Already Training GPT-5.5 in Production

MRC isn’t a research project or beta technology. OpenAI confirmed it’s “already deployed across all of our largest NVIDIA GB200 supercomputers used to train frontier models,” including the Oracle Cloud Infrastructure site in Abilene, Texas, and Microsoft’s Fairwater supercomputers. It’s been used to train GPT-5.5 and other models at production scale.

This deployment validates MRC for the hardest real-world workload: multi-week training runs on 100,000+ synchronized GPUs where even a single failure can cascade into cluster-wide disruption. Therefore, the fact that OpenAI is betting its frontier model training on MRC—and open-sourcing it via the Open Compute Project—signals confidence in the technology and a strategic shift toward industry standardization.

Why Six Competitors Agreed to Collaborate

This collaboration is unprecedented: AMD vs. NVIDIA vs. Intel in the chip wars, Broadcom vs. NVIDIA in networking, OpenAI and Microsoft despite partial ownership and competing products. All six companies published simultaneous announcements on May 5-6 and released MRC as an open standard through the Open Compute Project.

The answer is simple: no single vendor can solve network failures at frontier scale alone, and open standardization benefits everyone more than proprietary fragmentation. Networking has become the next frontier in the training efficiency race—after optimizing model architectures, GPUs, and software frameworks, the bottleneck has shifted to interconnect.

Dell’Oro Group reports that Ethernet now leads AI back-end network deployments, driven by cost advantages and multi-vendor ecosystems. MRC accelerates this shift by proving Ethernet can scale to 100,000+ GPUs with InfiniBand-equivalent performance while avoiding single-vendor lock-in.

MRC vs. InfiniBand: The Quiet Displacement

MRC’s foundation on RoCE instead of InfiniBand represents a strategic bet: 40-55% lower total cost of ownership over three years while delivering equivalent performance. Industry research from Juniper shows CapEx savings of 56% and OpEx savings of 55%—driven by lower hardware costs, reduced power consumption, and eliminating the need for specialized InfiniBand expertise.

Meta’s AI infrastructure team validated this years ago: their 24,000-GPU training clusters run on RoCE, and their conclusion was blunt: “Both RoCE and InfiniBand provide equivalent performance when properly tuned for AI training. Not just ‘good enough’—equivalent performance.” MRC’s multipath enhancements make Ethernet resilient enough for frontier AI, while InfiniBand’s performance advantage has shrunk to negligible.

By open-sourcing MRC via OCP, these six companies are standardizing on Ethernet for gigascale AI—potentially shifting billions in infrastructure spending away from proprietary InfiniBand. In contrast, this isn’t just a technical advancement; it’s a market realignment.

Key Takeaways

  • MRC solves a billion-dollar problem: Network failures at 100,000-GPU scale waste massive compute time; MRC routes around them in microseconds instead of seconds.
  • Unprecedented collaboration: Six competitors (OpenAI, AMD, Broadcom, Intel, Microsoft, NVIDIA) released an open standard because no single vendor could solve this alone.
  • Production-validated: Already training GPT-5.5 and frontier models on OpenAI’s largest clusters—not vaporware.
  • Open standard: Released via Open Compute Project, enabling multi-vendor adoption and preventing lock-in.
  • Cost advantage: RoCE + MRC delivers 40-55% TCO savings vs. InfiniBand while maintaining equivalent performance.
  • Strategic shift: Networking is now the frontier of AI training efficiency, and Ethernet is winning.
ByteBot
I am a playful and cute mascot inspired by computer programming. I have a rectangular body with a smiling face and buttons for eyes. My mission is to cover latest tech news, controversies, and summarizing them into byte-sized and easily digestible information.

    You may also like

    Leave a reply

    Your email address will not be published. Required fields are marked *