GPU Backend Fabric Design with RoCE v2

Why GPU Backend Fabric Design Is Different from General Data Center Networking

A GPU backend fabric carries bulk synchronous traffic between GPUs during collective operations such as AllReduce, AllGather, and ReduceScatter. Unlike general east-west data center traffic that tolerates occasional packet drops and retransmissions, GPU backend traffic is latency- and loss-sensitive. A single packet drop on a 400G link carrying an NCCL AllReduce can stall an entire training iteration, wasting GPU cycles across the cluster.

The consequence is that GPU backend fabrics require lossless Layer 2 transport. RoCE v2 (RDMA over Converged Ethernet version 2) is the dominant protocol for this role because it delivers near-InfiniBand latency over commodity Ethernet. However, RoCE v2 relies on Priority Flow Control (PFC) and Explicit Congestion Notification (ECN) to achieve lossless behavior, and misconfiguring either can cause head-of-line blocking, PFC storms, or silent throughput collapse.

For Australian data center teams, this means the fabric design must account for:

Consistent PFC and ECN configuration across every switch in the fabric
Strict traffic isolation between RoCE v2 and non-RDMA workloads
Congestion notification propagation at wire speed
Telemetry that can detect microbursts and incast patterns before they impact training jobs

This guide walks through the decision criteria, configuration checklists, and operational practices needed to deploy a reliable 400G or 800G GPU backend fabric using xSONIC switches running Enterprise SONiC.

400G vs 800G: Decision Criteria for GPU Backend Fabric

The choice between 400G and 800G per link depends on GPU density, training job scale, and budget. Use the following decision table as a starting framework.

Factor	400G (QSFP-DD / OSFP)	800G (OSFP-XD / QSFP-DD800)
Per-port bandwidth	400 Gbps	800 Gbps
Typical radix per switch	32-64 ports	16-32 ports
GPU cluster size supported per switch tier	64-256 GPUs	128-512 GPUs
Latency target	Sub-5 microsecond hop	Sub-5 microsecond hop (comparable)
Optic cost per port	Lower, mature supply	Higher, limited supply in AU market
Cable reach (SR)	Up to 100m multimode	Up to 50-100m multimode
Cable reach (DR)	Up to 2km single-mode	Up to 2km single-mode
Maturity for SONiC	Broad platform support	Emerging platform support
Best fit	Clusters of 64-512 GPUs	Clusters of 256-2048+ GPUs

Key decision points:

GPU density per rack. If each rack hosts 8 or 16 GPUs (for example, 2 or 4 GPU servers with 4 or 8 GPUs each), and you need non-blocking bandwidth at the ToR, 400G per uplink is typically sufficient for clusters up to approximately 256 GPUs. For larger clusters or higher per-GPU bandwidth, 800G spine links reduce oversubscription.
Training job scale. Large language model (LLM) training with hundreds of GPUs running tensor parallelism across nodes needs the lowest possible tail latency. 800G spine links reduce the number of oversubscription points.
Optic and cable availability in Australia. As of mid-2026, 800G optics (OSFP-XD) have more limited supply channels in the Australian market compared to 400G QSFP-DD and OSFP modules. Verify lead times with local distributors before committing to 800G at the spine tier.
Future-proofing. If the cluster will scale beyond 512 GPUs within 18 months, deploying 800G-capable spine switches now (even if initially populated with 400G optics) avoids a forklift upgrade later.

Spine-Leaf Topology for GPU Backend Fabric

GPU backend fabrics use a two-tier or three-tier spine-leaf topology. The design goal is non-blocking bandwidth between any two GPU endpoints.

Two-tier spine-leaf (recommended for up to ~512 GPUs):

Leaf switches connect to GPU servers via 400G downlinks
Each leaf switch has 400G or 800G uplinks to every spine switch
Spine count is determined by the uplink oversubscription ratio

Three-tier (Clos) fabric (for clusters exceeding ~512 GPUs):

An additional super-spine tier provides east-west bandwidth scaling
Super-spine switches use 800G links to spine switches
Leaf-to-spine and spine-to-super-spine links are all 400G or 800G

Design rules for lossless operation:

Maintain a 1:1 (non-blocking) oversubscription ratio for leaf-to-spine uplinks whenever budget allows. If oversubscription is necessary, keep it at 2:1 or lower.
Every switch in the fabric path must run the same PFC and ECN configuration. Inconsistent settings create lossy pockets inside a lossless fabric.
Use a dedicated VLAN or VRF for RoCE v2 traffic. Do not mix general-purpose TCP traffic on the same priority queue.
Assign a dedicated traffic class and priority for RoCE v2 using 802.1p priority values (typically Priority Group 3 or 4 for RoCE data, and Priority Group 6 for RoCE control/CNP).

Recommended VLAN and priority mapping:

Traffic Type	802.1p Priority	PFC Enabled	ECN Marking	Notes
RoCE v2 Data (RDMA Write/Read)	3	Yes	Yes	Bulk data transfers
RoCE v2 Control (CNP)	6	No	No	Congestion Notification Packets
General TCP/IP	0	No	No	Management, storage, metadata
Storage (NFS/iSCSI)	1	Optional	Optional	If co-located

RoCE v2 Configuration Checklist

Use this checklist when configuring RoCE v2 on xSONIC switches for GPU backend fabric. Every item must be verified on every switch in the data path.

Global Settings:

DSCP-based QoS mode enabled (RoCE v2 uses DSCP 26 for data, DSCP 48 for CNP by default; confirm with NIC vendor)
ECN enabled on the RoCE v2 traffic class (WRED thresholds configured)
PFC enabled on the RoCE v2 traffic class (priority 3 for data traffic)
PFC watchdog enabled to detect and recover from PFC storms
Strict priority queuing for CNP traffic (priority 6) to ensure congestion notifications are never dropped

Per-Interface Settings:

PFC enabled on every interface in the fabric path (leaf-to-spine, spine-to-leaf)
ECN marking thresholds set per interface based on buffer depth
PFC deadlock detection and recovery enabled on all fabric ports
MTU set to 9216 (jumbo frames) on all RoCE v2 VLAN interfaces
Link-level flow control (IEEE 802.3x PAUSE) disabled on fabric ports (PFC replaces PAUSE)

VLAN and Routing:

Dedicated VLAN for RoCE v2 traffic created
RoCE v2 VLAN trunked on all leaf-to-server and leaf-to-spine links
Subnet routing or VRF isolation configured if RoCE v2 traffic must not leak to management networks
ARP and ND inspection if required by security policy (but verify no interference with RDMA connection setup)

Verification Commands:

show priority-flow-control - verify PFC is active on the correct priorities and interfaces
show ecn - verify ECN marking is enabled and thresholds are correct
show interfaces counters - confirm no PAUSE frame counters are incrementing on fabric links
show queue counters - verify RoCE traffic class is transmitting and no drops on priority 3 or 6

DCBX and Priority Flow Control Configuration

Data Center Bridging Capability Exchange (DCBX) is the IEEE/ANSI protocol suite that advertises and negotiates lossless Ethernet parameters between peers. DCBX includes PFC, ETS (Enhanced Transmission Selection), and Application Protocol TLVs.

Why DCBX matters for GPU backend fabric:

PFC negotiation ensures that both endpoints of a link agree on which priorities use pause frames. If one end enables PFC on priority 3 and the other does not, the fabric will silently drop traffic.
ETS ensures bandwidth allocation across traffic classes. In a GPU backend fabric, RoCE v2 data traffic should receive a guaranteed minimum bandwidth share.
Application Protocol TLVs can advertise RoCE v2 (iWARP or RoCE) to peer devices, enabling automatic configuration where supported.

Recommended DCBX configuration:

Enable DCBX on all leaf and spine interfaces that carry RoCE v2 traffic
Configure PFC mode as “auto” to allow DCBX negotiation with peer NICs and switches
Set PFC on priorities 3 (RoCE data) and enable PFC-compatible mode on priority 6 (CNP), though CNP itself does not require PFC
Configure ETS to allocate a minimum of 50-70% of link bandwidth to the RoCE v2 traffic class, with the remainder shared across other classes
Verify DCBX neighbor status after link-up to confirm both sides have negotiated identical parameters

Common DCBX failure modes:

Symptom	Likely Cause	Resolution
RoCE transfers stall after link flap	PFC negotiation failed after flap	Verify DCBX neighbor status; check for firmware mismatch
Head-of-line blocking on spine	PFC enabled on wrong priority	Verify priority mapping matches leaf config
PFC storm on single link	Persistent congestion with no CNP response	Check NIC firmware for ECN/CNP support; verify Fast CNP on switch
Asymmetric PFC state	DCBX not enabled on one end	Enable DCBX on both switch and NIC side

See the xSONIC DCBX Technology solution guide for detailed configuration procedures.

Sources Reviewed

Graphics Cards / GPU | Computer Parts | PC Parts - Umart.com.au: https://www.umart.com.au/pc-parts/computer-parts/graphics-cards-gpu-610
Supports: input source for finding, recommendation, claim, and evidence review.
What Is a GPU ? Graphics Processing Units Defined - Intel: https://www.intel.com/content/www/us/en/products/docs/processors/what-is-a-gpu.html
Supports: input source for finding, recommendation, claim, and evidence review.
Graphics processing unit - Wikipedia: https://en.wikipedia.org/wiki/Graphics_processing_unit
Supports: input source for finding, recommendation, claim, and evidence review.
GPU: https://gpu.travel/
Supports: input source for finding, recommendation, claim, and evidence review.
GPU-Z Graphics Card GPU Information Utility - TechPowerUp: https://www.techpowerup.com/gpuz
Supports: input source for finding, recommendation, claim, and evidence review.
Buy High-Quality Graphics Cards Online | NVIDIA & AMD GPUs: https://www.centrecom.com.au/nvidia-amd-graphics-cards
Supports: input source for finding, recommendation, claim, and evidence review.

GPU Backend Fabric Design with RoCE v2: A 400G and 800G Deployment Playbook for Australian Data Centers