Data Center Solution

RoCE v2 Deployment Guide

Build predictable lossless Ethernet for RDMA and GPU workloads.

Back to Data Center Solutions

Overview

RoCEv2 carries RDMA over routed Ethernet and is widely used in AI, HPC, and high-performance storage networks. It gives applications low-latency remote memory access while preserving the operational flexibility of Ethernet.

RoCEv2 is also unforgiving. Packet loss, congestion, incorrect priority mapping, or inconsistent lossless policy can quickly reduce job performance. An xSONiC RoCEv2 design should therefore treat QoS, routing, telemetry, and failure testing as one system.

RoCEv2 Design Stack

LayerDesign DecisionValidation
ApplicationIdentify RDMA workloads and traffic phases.Test all-reduce, storage, and failure behavior.
Host NICConfigure priorities, DSCP, ECN, and PFC expectations.Confirm NIC counters and congestion response.
xSONiC leafMap traffic classes to queues and lossless priorities.Check PFC, ECN, ETS, and DCBX state.
FabricProvide predictable ECMP, bandwidth, and convergence.Validate path diversity and failure recovery.
TelemetryMonitor queue depth, drops, pause frames, and latency.Correlate network state with workload timing.

Traffic Classification

RoCEv2 deployments should keep RDMA traffic classification explicit. Operators usually define a DSCP or priority value for RDMA, map that value to a queue, and then apply PFC only where lossless behavior is required.

Application traffic
      |
      v
Host NIC marks DSCP / priority
      |
      v
xSONiC leaf maps priority to queue
      |
      v
PFC / ECN / ETS policy applies to selected traffic class

Congestion Controls

MechanismPurposeDesign Warning
PFCPrevents packet loss for selected priorities.Overuse can spread pause behavior across the fabric.
ECNMarks congestion before queue overflow.Thresholds must match buffer and workload behavior.
CNPTells senders to reduce rate after congestion feedback.Feedback path delay matters during incast.
Fast CNPShortens sender notification in supported designs.Requires flow awareness and careful validation.
ETSBalances bandwidth among traffic classes.Avoid starving non-RDMA operational traffic.

Reference Fabric Pattern

GPU / storage servers
        |
        v
100G / 200G / 400G / 800G xSONiC leaves
        |
        v
High-radix xSONiC spines
        |
        v
Peer pods, storage, or backend GPU domains

Large AI clusters often separate backend GPU traffic, storage traffic, and frontend service traffic. Smaller deployments may share layers, but the QoS policy should still keep traffic classes explicit.

Deployment Checklist

  1. Define RDMA traffic classes and DSCP or priority mappings.
  2. Align host NIC, xSONiC switch, and application expectations.
  3. Enable PFC only for the priorities that require lossless behavior.
  4. Set ECN thresholds using queue and workload testing, not default values alone.
  5. Validate DCBX state on server-facing links where negotiation is used.
  6. Run incast, all-reduce, storage read/write, and link-failure tests.
  7. Monitor queue depth, drops, ECN marks, CNP rate, and PFC pause frames together.

Common Failure Modes

SymptomLikely CauseInvestigation
Training step time spikesQueue buildup or path imbalance.Inspect queue delay, ECN marks, and ECMP path distribution.
PFC pause stormsLossless class under sustained pressure.Check thresholds, traffic mix, and priority mapping.
RDMA retransmissionLossless policy not consistent end to end.Compare host NIC and switch QoS state.
Good average utilization but poor job performanceMicrobursts or tail latency.Use INT/IPT-style telemetry and workload phase correlation.

xSONiC Platform Fit

xSONiC 400G and 800G switches fit backend AI fabrics where east-west bandwidth dominates. 100G and 200G systems are useful for storage, frontend, and migration layers where operational stability matters as much as raw port speed.

Related Products

Products commonly paired with this solution.

Use these related platforms as a starting point for sizing, comparison, and follow-up discussion.

XS-DC-64X800-AI-G1 front panel product image

XS-DC-64X800-AI-G1

Data Center AI

64-port 800G AI fabric switch for large-scale GPU clusters, HPC backbones, and ultra-high-throughput data center networks.

51.2Tbps
42,000Mpps
XS-DC-64X200-LS-G1 front panel product image

XS-DC-64X200-LS-G1

Data Center AI

64-port 200G leaf/spine switch for high-bandwidth storage, compute, and scale-out data center fabrics.

12.8Tbps
19,040Mpps
Next Step

Move from RoCE v2 Deployment Guide into implementation.

Use the related products below to continue comparing platforms, or open a conversation if you need help mapping the solution to your environment.