Blog

Why AI Teams Are Reconsidering Ethernet Fabrics for GPU Clusters: An Open Networking Angle

A source-backed editorial analysis examining the industry signals that Ethernet is gaining ground as a viable fabric for AI/ML GPU clusters, historically dominated by InfiniBand. This brief synthesizes vendor direction

By xSONiC Team · · SONiCopen networkingdata centerAI fabricEthernetautomation

What Happened: Ethernet Is Making a Credible Case for AI Fabrics

For years, InfiniBand has been the default fabric for high-performance GPU clusters used in AI training and inference. Its low latency, lossless transport, and native RDMA support made it the obvious choice when GPU-to-GPU communication speed directly determined model training time.

That consensus is now under pressure. Major networking vendors are investing heavily in Ethernet-based alternatives purpose-built for AI workloads. The most visible signal comes from NVIDIA itself: alongside its well-established Quantum InfiniBand line, NVIDIA has built out the Spectrum-X Ethernet platform, explicitly positioning it as an ‘Ethernet platform for hyperscale AI cloud networking.’ The Spectrum-6 (SN6000) family, announced for use in NVIDIA Rubin-based AI factories, pushes Ethernet switching to 102.4 Tb/s aggregate throughput per switch with 800 Gb/s port speeds and co-packaged silicon photonics. The fact that NVIDIA — the company with arguably the strongest InfiniBand franchise — is also aggressively marketing Ethernet for AI is the headline signal.

Simultaneously, the SONiC (Software for Open Networking in the Cloud) ecosystem has matured. SONiC Foundation describes it as ‘an open source network operating system based on Linux that runs on switches from multiple vendors and ASICs,’ offering ‘a full suite of network functionality, like BGP and RDMA, that has been production-hardened in the data centers of some of the largest cloud service providers.’ Its container-based architecture decouples network functions into modular Docker containers, enabling independent upgrades and fault isolation — features that matter when an AI fabric must support iterative infrastructure changes without full-stack downtime.

Why It Matters: The InfiniBand Lock-In Problem Meets the Open Networking Argument

The reconsideration of Ethernet for GPU fabrics is not purely a technical debate. It is also a procurement and architecture strategy question.

InfiniBand delivers proven performance, but it comes with a narrower vendor ecosystem. When an organization commits to an InfiniBand fabric for its AI cluster, it is typically committing to a specific vendor’s switch ASICs, adapters, cables, and management software. The switching and cabling options are fewer, supply chain flexibility is lower, and the skills pool for InfiniBand operations is smaller than for Ethernet.

Ethernet, by contrast, offers:

  • A multi-vendor switching ecosystem (Broadcom, Marvell, and others produce Ethernet switch ASICs that SONiC can target)
  • Broader operational skills availability (Ethernet knowledge is ubiquitous in data center teams)
  • Standards-based protocols (RoCE v2 for RDMA, ECN and PFC for lossless behavior, DCBX for auto-configuration)
  • Open NOS options including SONiC, which provides hardware abstraction through the Switch Abstraction Interface (SAI)

SONiC as the Open NOS Option for AI Fabric Deployments

SONiC’s relevance to the AI fabric discussion centers on its ability to run on switches from multiple hardware vendors, decoupling the network operating system from the underlying ASIC. The SONiC Foundation describes this as built on the Switch Abstraction Interface (SAI), which ‘helps in accelerating hardware innovation’ by separating the software layer from silicon-specific dependencies.

For an Australian data center operator evaluating a GPU cluster fabric, SONiC offers a practical architectural argument: if Ethernet is the chosen transport, and RoCE v2 handles the RDMA requirements, then SONiC provides an open platform that avoids tying the fabric to a single vendor’s NOS. NVIDIA itself supports this model, listing ‘Pure SONiC’ alongside Cumulus Linux as a supported NOS on its Spectrum Ethernet switches.

Key SONiC capabilities relevant to AI fabrics include:

  • BGP-based underlay and overlay routing for spine-leaf AI cluster topologies
  • RDMA and RoCE support for GPU-to-GPU communication
  • Containerized modular architecture for targeted upgrades
  • SAI-based hardware abstraction for multi-vendor switch deployment

However, SONiC production deployment for AI-specific fabrics is not a trivial undertaking. The operational team needs Linux networking expertise, familiarity with SONiC’s configuration model, and comfort with open-source community-driven support. For teams without this skill set, the flexibility benefit must be weighed against the operational ramp-up cost.

xSONIC Buyer Angle: Open Ethernet Fabrics and the Australian AI Infrastructure Moment

Australia’s AI infrastructure market is in an early but accelerating build-out phase. Organizations investing in private GPU clusters for AI workloads — whether in hyperscale colocation facilities, enterprise data centers, or university research environments — face the InfiniBand-vs-Ethernet decision at the fabric layer.

The xSONIC perspective on this decision is direct: if Ethernet with RoCE v2 and modern congestion management can serve the GPU backend fabric, then an open networking stack (SONiC on multi-vendor switch hardware) gives the buyer the widest optionality in hardware sourcing, optics selection, and operational tooling.

For an Australian buyer, this matters in several ways:

  • Supply chain resilience: multi-vendor switching reduces dependency on a single manufacturer’s lead times and pricing power
  • Optics flexibility: Ethernet’s QSFP28, QSFP-DD, and OSFP optics ecosystem is broader and more competitive than InfiniBand’s, and xSONIC’s optical transceiver range targets these standard form factors
  • Operational familiarity: Ethernet fabric operations use skills that are already present in most Australian data center teams
  • Future-proofing: Ethernet’s speed roadmap (100G, 400G, 800G, and beyond) has strong industry momentum, with silicon photonics integration as a next step for power-efficient scaling

The key evaluation question for any specific AI cluster project remains: does the workload’s latency sensitivity require InfiniBand’s tighter tail latency, or can well-configured RoCE v2 with DCBX, ECN, PFC, and fast CNP deliver acceptable performance? This is a benchmarking and proof-of-concept question, not a marketing question.

What to Watch: Signals That the Ethernet-for-AI Trend Is Real

Editorial candidates covering this topic should track the following signals as evidence develops:

  1. SONiC adoption stories for AI fabric deployments. The SONiC Foundation references production use in ‘some of the largest cloud service providers’ but does not name specific AI cluster deployments. Named case studies with deployment scale and workload type would strengthen the editorial thesis significantly.

  2. Australian market signals: local hyperscaler announcements, university HPC/AI cluster procurement decisions, or colocation provider fabric choices that reference Ethernet-based AI infrastructure.

  3. Open networking vendor ecosystem expansion: additional SONiC-compatible switch platforms for AI workloads, beyond NVIDIA’s Spectrum line, that provide Australian buyers with genuine multi-vendor switching options.

  4. Telemetry and observability maturity: whether INT (In-band Network Telemetry) and IPTPath telemetry capabilities in SONiC-based Ethernet fabrics can provide the AI fabric visibility that operations teams need to diagnose congestion and tail latency issues.

This editorial angle is a strong candidate for an xSONIC news analysis article, but it should not be published until at least two or three of these signals have independent source backing.

Sources Reviewed