The InfiniBand Monopoly on AI Networking Is Cracking
For years, the default answer to ‘what fabric connects your GPU cluster?’ was InfiniBand. Training large language models and running inference at scale demanded the low latency and lossless transport that InfiniBand delivered, and few alternatives existed. That assumption is now under pressure.
This is not a niche experiment. When the same vendor that sells the dominant AI interconnect starts shipping competing Ethernet silicon with AI-specific features, the market signal is clear: Ethernet has closed enough of the performance gap to be taken seriously for GPU backend fabrics.
For Australian AI teams evaluating their next infrastructure investment, this shift creates a meaningful decision point that did not exist two or three years ago.
What Changed: RDMA over Ethernet Becomes Production-Ready
The core technical barrier to Ethernet in AI clusters was never raw bandwidth — 400GbE and 800GbE optics have been available for some time. The barrier was transport behavior. AI collective operations (all-reduce, all-to-all) are latency-sensitive and intolerant of packet loss. InfiniBand’s lossless fabric and native RDMA primitives made it the safe choice.
Ethernet has caught up through a stack of standards and silicon innovations:
- RoCE v2 (RDMA over Converged Ethernet v2): Carries RDMA operations over standard UDP/IP Ethernet, enabling GPU-to-GPU memory transfers without CPU involvement.
- DCBX (Data Center Bridging Capability Exchange): Negotiates lossless Ethernet parameters between switches and endpoints, enabling Priority Flow Control (PFC) to prevent buffer overruns.
- Fast CNP (Congestion Notification Profile): Provides rapid congestion feedback to RDMA senders, reducing tail latency under load.
- INT (In-band Network Telemetry): Inserts per-hop latency and queue depth metadata into packet headers, giving AI fabric operators real-time visibility into fabric health.
These capabilities are not theoretical. SONiC (Software for Open Networking in the Cloud), the Linux Foundation open-source network operating system, supports BGP and RDMA and has been ‘production-hardened in the data centers of some of the largest cloud service providers,’ according to the SONiC Foundation. SONiC runs on switches from multiple vendors and ASICs, decoupling hardware from software and accelerating the pace of innovation.
For AI teams, this means Ethernet fabrics can now deliver the lossless, low-latency transport that GPU collective operations demand — without requiring a proprietary interconnect stack.
Why the Economics Favor Ethernet for Growing AI Clusters
The technical argument for Ethernet in AI clusters is necessary but not sufficient. The economic argument may be what tips the balance for many organizations.
Ethernet offers several structural advantages that matter at scale:
| Factor | InfiniBand | Ethernet |
|---|---|---|
| Switch vendor options | Limited (primarily NVIDIA Mellanox) | Broad (NVIDIA Spectrum, Broadcom, Marvell, and white-box OEMs) |
| NOS flexibility | Proprietary | SONiC, Cumulus Linux, vendor NOS |
| Transceiver ecosystem | Specialized | Massive (SFP28, QSFP28, QSFP-DD, OSFP from many vendors) |
| Operations team familiarity | Low (dedicated IB teams often required) | High (most data center teams already operate Ethernet) |
| Supply chain risk | Concentrated | Distributed |
| Campus-to-DC fabric convergence | Not possible | Single fabric standard |
The supply chain argument carries particular weight in Australia, where networking hardware lead times and vendor concentration create real operational risk. An Ethernet-based AI fabric built on SONiC and open switching hardware gives procurement teams multiple sourcing paths.
The operations argument is equally important. Running an AI cluster on a different fabric than the rest of the data center means training separate teams, maintaining separate monitoring, and debugging through different troubleshooting workflows. Ethernet convergence eliminates that operational tax.
What This Means for Australian AI Infrastructure Buyers
Australia’s AI infrastructure market faces a specific set of constraints: distance from major hardware distribution hubs, a concentrated pool of networking talent, and growing demand for sovereign AI capabilities across government, research, and enterprise sectors.
Ethernet-based AI fabrics address several of these constraints directly:
Wider procurement options. Ethernet switches based on Broadcom or Marvell silicon are available from multiple OEM and ODM vendors. This reduces dependence on a single supplier and can shorten lead times for Australian buyers who may face longer shipping windows than US or European counterparts.
SONiC as a portable NOS. Because SONiC runs on switches from multiple vendors and ASICs, Australian teams can evaluate hardware on performance and price rather than being locked to a single NOS ecosystem. The SONiC community includes both large-scale cloud operators and enterprise adopters, and the project’s container-based architecture supports incremental upgrades without full fabric requalification.
Skills alignment. Most Australian data center teams operate Ethernet fabrics today. Extending Ethernet into the AI backend means the existing team can own the fabric without hiring or training for a specialized interconnect protocol.
Optics ecosystem. Ethernet’s massive transceiver ecosystem means Australian buyers can source 100G, 400G, and 800G optics from multiple vendors, including those with Australian stock and warranty support. This matters when a single failed optic can stall an AI training job worth significant compute hours.
The decision is not one-size-fits-all. InfiniBand still has meaningful advantages in raw latency at extreme scale, and organizations with existing InfiniBand expertise and installed base may not benefit from a mid-cycle switch. But for new greenfield AI clusters — particularly those under 1,000 GPUs — Ethernet deserves serious evaluation.
The SONiC Factor: Why Open Networking Matters for AI Fabrics
The role of SONiC in this Ethernet-for-AI shift deserves separate attention because it changes the buyer calculus significantly.
SONiC is a free and open-source network operating system based on Linux, hosted under the Linux Foundation. It runs on switches from multiple vendors and multiple ASIC families. Its key architectural properties are directly relevant to AI fabric operations:
- Container-based modular design: Each network function runs in its own Docker container, providing fault isolation, easier debugging, and simplified upgrades. For AI clusters where fabric downtime can waste hours of GPU compute, this matters.
- Standard Linux interfaces: SONiC uses standard Linux tools and interfaces, meaning automation, monitoring, and observability tooling that teams already use for servers can extend to the network fabric.
- Production-hardened RDMA: SONiC supports BGP and RDMA, and has been deployed in production at large-scale cloud providers. This is not a lab experiment.
- Multi-vendor hardware support: SONiC’s Switch Abstraction Interface (SAI) decouples the NOS from the underlying ASIC, enabling hardware selection based on performance, price, and availability rather than software lock-in.
For xSONIC’s data center AI switch product direction, SONiC alignment is foundational. xSONIC data center AI switches running Enterprise SONiC can deliver the RoCE v2, DCBX, Fast CNP, and INT telemetry capabilities that Ethernet-based AI fabrics require, while giving Australian buyers the NOS portability and multi-vendor flexibility that proprietary stacks cannot match.
The SONiC community’s growth is also worth noting. The GitHub repository for SONiC shows over 2,800 stars and nearly 1,300 forks, with an active contributor base. The ecosystem includes contributions from major chip vendors and cloud operators, ensuring that SONiC’s AI-relevant features (RDMA, telemetry, congestion management) continue to evolve at pace with the market.
The Competitive Landscape: Not Just Ethernet vs InfiniBand
The networking choice for AI clusters is often framed as a binary: Ethernet or InfiniBand. The reality is more nuanced and creates opportunity for buyers willing to evaluate open alternatives.
The Ethernet switching market for AI now includes multiple serious contenders:
- NVIDIA Spectrum-X: A vertically integrated Ethernet platform built on Spectrum-4 and Spectrum-6 silicon, with its own software stack (Cumulus Linux, Pure SONiC, NetQ, DSX Air). The Spectrum-6 SN6000 series introduces co-packaged optics for improved resiliency and power efficiency.
- Broadcom switching silicon: Powers switches from multiple OEMs and forms the basis of many white-box and bare-metal switch platforms. Broadcom’s Ethernet connectivity and switching products are widely deployed in cloud and enterprise data centers.
- Open networking via SONiC on bare-metal hardware: A disaggregated approach where buyers select switching hardware, SONiC as the NOS, and optics independently. This maximizes supply chain flexibility and eliminates single-vendor NOS lock-in.
For Australian buyers, the competitive landscape matters because it determines negotiating leverage, support options, and long-term switching costs. A fabric built on open SONiC running on bare-metal switches gives procurement teams the ability to qualify multiple hardware vendors against the same software baseline — a powerful position in a market where hardware availability can be unpredictable.
NVIDIA’s own inclusion of ‘Pure SONiC’ as a supported NOS option on its Spectrum switches validates the SONiC ecosystem even for buyers who prefer NVIDIA hardware. The question for Australian AI teams is not whether SONiC is production-ready — it clearly is — but which combination of hardware, NOS, and optics best fits their specific cluster scale, performance targets, and operational model.
What to Watch Next
Several developments will shape the Ethernet-for-AI story in the coming quarters:
-
800GbE adoption in AI clusters: Both NVIDIA Spectrum-6 and competing silicon now support 800GbE ports. Watch for real-world deployment reports from cloud providers and research institutions.
-
SONiC RDMA feature maturation: As the SONiC community adds more sophisticated congestion management, telemetry, and RDMA optimization features, the gap between Ethernet and InfiniBand for AI workloads continues to narrow.
-
Optics pricing at 400G and 800G: Ethernet’s advantage partly depends on optics cost and availability. Monitor 400G QSFP-DD and 800G OSFP pricing trends, especially for Australian stock.
-
Co-packaged optics for AI switches: NVIDIA’s Spectrum-6 introduces co-packaged silicon photonics, which could shift the power efficiency and reliability equation further in Ethernet’s favor. This technology is worth watching but remains early-stage for most buyers.
The bottom line for Australian AI teams: Ethernet is no longer the ‘good enough’ alternative to InfiniBand for GPU clusters. It is a legitimate primary fabric choice, backed by production-hardened open-source software, a competitive switching silicon market, and an economics story that favors organizations building new AI infrastructure. The evaluation should be on every AI team’s shortlist.
Related xSONiC Resources
Sources Reviewed
- Submit a copyright removal request - YouTube Help: https://support.google.com/youtube/answer/2807622?hl=en
- Supports: input source for finding, recommendation, claim, and evidence review.
- SONiC Foundation: https://sonicfoundation.dev/
- Supports: input source for finding, recommendation, claim, and evidence review.
- SONiC GitHub: https://github.com/sonic-net/SONiC
- Supports: input source for finding, recommendation, claim, and evidence review.
- Azure SONiC Documentation: https://azure.github.io/SONiC
- Supports: input source for finding, recommendation, claim, and evidence review.
- Open Compute Networking: https://www.opencompute.org/projects/networking
- Supports: input source for finding, recommendation, claim, and evidence review.
- Broadcom Ethernet Switching: https://www.broadcom.com/products/ethernet-connectivity/switching
- Supports: input source for finding, recommendation, claim, and evidence review.
- Marvell Switching: https://www.marvell.com/products/switching.html
- Supports: input source for finding, recommendation, claim, and evidence review.
- NVIDIA Ethernet Switching: https://www.nvidia.com/en-us/networking/ethernet-switching
- Supports: input source for finding, recommendation, claim, and evidence review.