Why GPU Backend Fabric Design Is Different from General Data Center Networking
A GPU backend fabric carries bulk synchronous traffic between GPUs during collective operations such as AllReduce, AllGather, and ReduceScatter. Unlike general east-west data center traffic that tolerates occasional packet drops and retransmissions, GPU backend traffic is latency- and loss-sensitive. A single packet drop on a 400G link carrying an NCCL AllReduce can stall an entire training iteration, wasting GPU cycles across the cluster.
The consequence is that GPU backend fabrics require lossless Layer 2 transport. RoCE v2 (RDMA over Converged Ethernet version 2) is the dominant protocol for this role because it delivers near-InfiniBand latency over commodity Ethernet. However, RoCE v2 relies on Priority Flow Control (PFC) and Explicit Congestion Notification (ECN) to achieve lossless behavior, and misconfiguring either can cause head-of-line blocking, PFC storms, or silent throughput collapse.
For Australian data center teams, this means the fabric design must account for:
- Consistent PFC and ECN configuration across every switch in the fabric
- Strict traffic isolation between RoCE v2 and non-RDMA workloads
- Congestion notification propagation at wire speed
- Telemetry that can detect microbursts and incast patterns before they impact training jobs
This guide walks through the decision criteria, configuration checklists, and operational practices needed to deploy a reliable 400G or 800G GPU backend fabric using xSONIC switches running Enterprise SONiC.
400G vs 800G: Decision Criteria for GPU Backend Fabric
The choice between 400G and 800G per link depends on GPU density, training job scale, and budget. Use the following decision table as a starting framework.
| Factor | 400G (QSFP-DD / OSFP) | 800G (OSFP-XD / QSFP-DD800) |
|---|---|---|
| Per-port bandwidth | 400 Gbps | 800 Gbps |
| Typical radix per switch | 32-64 ports | 16-32 ports |
| GPU cluster size supported per switch tier | 64-256 GPUs | 128-512 GPUs |
| Latency target | Sub-5 microsecond hop | Sub-5 microsecond hop (comparable) |
| Optic cost per port | Lower, mature supply | Higher, limited supply in AU market |
| Cable reach (SR) | Up to 100m multimode | Up to 50-100m multimode |
| Cable reach (DR) | Up to 2km single-mode | Up to 2km single-mode |
| Maturity for SONiC | Broad platform support | Emerging platform support |
| Best fit | Clusters of 64-512 GPUs | Clusters of 256-2048+ GPUs |
Key decision points:
-
GPU density per rack. If each rack hosts 8 or 16 GPUs (for example, 2 or 4 GPU servers with 4 or 8 GPUs each), and you need non-blocking bandwidth at the ToR, 400G per uplink is typically sufficient for clusters up to approximately 256 GPUs. For larger clusters or higher per-GPU bandwidth, 800G spine links reduce oversubscription.
-
Training job scale. Large language model (LLM) training with hundreds of GPUs running tensor parallelism across nodes needs the lowest possible tail latency. 800G spine links reduce the number of oversubscription points.
-
Optic and cable availability in Australia. As of mid-2026, 800G optics (OSFP-XD) have more limited supply channels in the Australian market compared to 400G QSFP-DD and OSFP modules. Verify lead times with local distributors before committing to 800G at the spine tier.
-
Future-proofing. If the cluster will scale beyond 512 GPUs within 18 months, deploying 800G-capable spine switches now (even if initially populated with 400G optics) avoids a forklift upgrade later.
Spine-Leaf Topology for GPU Backend Fabric
GPU backend fabrics use a two-tier or three-tier spine-leaf topology. The design goal is non-blocking bandwidth between any two GPU endpoints.
Two-tier spine-leaf (recommended for up to ~512 GPUs):
- Leaf switches connect to GPU servers via 400G downlinks
- Each leaf switch has 400G or 800G uplinks to every spine switch
- Spine count is determined by the uplink oversubscription ratio
Three-tier (Clos) fabric (for clusters exceeding ~512 GPUs):
- An additional super-spine tier provides east-west bandwidth scaling
- Super-spine switches use 800G links to spine switches
- Leaf-to-spine and spine-to-super-spine links are all 400G or 800G
Design rules for lossless operation:
- Maintain a 1:1 (non-blocking) oversubscription ratio for leaf-to-spine uplinks whenever budget allows. If oversubscription is necessary, keep it at 2:1 or lower.
- Every switch in the fabric path must run the same PFC and ECN configuration. Inconsistent settings create lossy pockets inside a lossless fabric.
- Use a dedicated VLAN or VRF for RoCE v2 traffic. Do not mix general-purpose TCP traffic on the same priority queue.
- Assign a dedicated traffic class and priority for RoCE v2 using 802.1p priority values (typically Priority Group 3 or 4 for RoCE data, and Priority Group 6 for RoCE control/CNP).
Recommended VLAN and priority mapping:
| Traffic Type | 802.1p Priority | PFC Enabled | ECN Marking | Notes |
|---|---|---|---|---|
| RoCE v2 Data (RDMA Write/Read) | 3 | Yes | Yes | Bulk data transfers |
| RoCE v2 Control (CNP) | 6 | No | No | Congestion Notification Packets |
| General TCP/IP | 0 | No | No | Management, storage, metadata |
| Storage (NFS/iSCSI) | 1 | Optional | Optional | If co-located |
RoCE v2 Configuration Checklist
Use this checklist when configuring RoCE v2 on xSONIC switches for GPU backend fabric. Every item must be verified on every switch in the data path.
Global Settings:
- DSCP-based QoS mode enabled (RoCE v2 uses DSCP 26 for data, DSCP 48 for CNP by default; confirm with NIC vendor)
- ECN enabled on the RoCE v2 traffic class (WRED thresholds configured)
- PFC enabled on the RoCE v2 traffic class (priority 3 for data traffic)
- PFC watchdog enabled to detect and recover from PFC storms
- Strict priority queuing for CNP traffic (priority 6) to ensure congestion notifications are never dropped
Per-Interface Settings:
- PFC enabled on every interface in the fabric path (leaf-to-spine, spine-to-leaf)
- ECN marking thresholds set per interface based on buffer depth
- PFC deadlock detection and recovery enabled on all fabric ports
- MTU set to 9216 (jumbo frames) on all RoCE v2 VLAN interfaces
- Link-level flow control (IEEE 802.3x PAUSE) disabled on fabric ports (PFC replaces PAUSE)
VLAN and Routing:
- Dedicated VLAN for RoCE v2 traffic created
- RoCE v2 VLAN trunked on all leaf-to-server and leaf-to-spine links
- Subnet routing or VRF isolation configured if RoCE v2 traffic must not leak to management networks
- ARP and ND inspection if required by security policy (but verify no interference with RDMA connection setup)
Verification Commands:
-
show priority-flow-control- verify PFC is active on the correct priorities and interfaces -
show ecn- verify ECN marking is enabled and thresholds are correct -
show interfaces counters- confirm no PAUSE frame counters are incrementing on fabric links -
show queue counters- verify RoCE traffic class is transmitting and no drops on priority 3 or 6
DCBX and Priority Flow Control Configuration
Data Center Bridging Capability Exchange (DCBX) is the IEEE/ANSI protocol suite that advertises and negotiates lossless Ethernet parameters between peers. DCBX includes PFC, ETS (Enhanced Transmission Selection), and Application Protocol TLVs.
Why DCBX matters for GPU backend fabric:
- PFC negotiation ensures that both endpoints of a link agree on which priorities use pause frames. If one end enables PFC on priority 3 and the other does not, the fabric will silently drop traffic.
- ETS ensures bandwidth allocation across traffic classes. In a GPU backend fabric, RoCE v2 data traffic should receive a guaranteed minimum bandwidth share.
- Application Protocol TLVs can advertise RoCE v2 (iWARP or RoCE) to peer devices, enabling automatic configuration where supported.
Recommended DCBX configuration:
- Enable DCBX on all leaf and spine interfaces that carry RoCE v2 traffic
- Configure PFC mode as “auto” to allow DCBX negotiation with peer NICs and switches
- Set PFC on priorities 3 (RoCE data) and enable PFC-compatible mode on priority 6 (CNP), though CNP itself does not require PFC
- Configure ETS to allocate a minimum of 50-70% of link bandwidth to the RoCE v2 traffic class, with the remainder shared across other classes
- Verify DCBX neighbor status after link-up to confirm both sides have negotiated identical parameters
Common DCBX failure modes:
| Symptom | Likely Cause | Resolution |
|---|---|---|
| RoCE transfers stall after link flap | PFC negotiation failed after flap | Verify DCBX neighbor status; check for firmware mismatch |
| Head-of-line blocking on spine | PFC enabled on wrong priority | Verify priority mapping matches leaf config |
| PFC storm on single link | Persistent congestion with no CNP response | Check NIC firmware for ECN/CNP support; verify Fast CNP on switch |
| Asymmetric PFC state | DCBX not enabled on one end | Enable DCBX on both switch and NIC side |
See the xSONIC DCBX Technology solution guide for detailed configuration procedures.
Related xSONiC Resources
Sources Reviewed
- Graphics Cards / GPU | Computer Parts | PC Parts - Umart.com.au: https://www.umart.com.au/pc-parts/computer-parts/graphics-cards-gpu-610
- Supports: input source for finding, recommendation, claim, and evidence review.
- What Is a GPU ? Graphics Processing Units Defined - Intel: https://www.intel.com/content/www/us/en/products/docs/processors/what-is-a-gpu.html
- Supports: input source for finding, recommendation, claim, and evidence review.
- Graphics processing unit - Wikipedia: https://en.wikipedia.org/wiki/Graphics_processing_unit
- Supports: input source for finding, recommendation, claim, and evidence review.
- GPU: https://gpu.travel/
- Supports: input source for finding, recommendation, claim, and evidence review.
- GPU-Z Graphics Card GPU Information Utility - TechPowerUp: https://www.techpowerup.com/gpuz
- Supports: input source for finding, recommendation, claim, and evidence review.
- Buy High-Quality Graphics Cards Online | NVIDIA & AMD GPUs: https://www.centrecom.com.au/nvidia-amd-graphics-cards
- Supports: input source for finding, recommendation, claim, and evidence review.