Blog

DCBX, PFC, ECN, and Fast CNP: The Congestion Control Stack Reshaping AI Ethernet Fabrics

A technical analysis of how DCBX, PFC, ECN, and Fast CNP work together to deliver lossless, low-jitter RoCE v2 performance in AI Ethernet fabrics, and what open networking means for Australian AI infrastructure buyers.

By xSONiC Team · · SONiCopen networkingdata centerAI fabricEthernetautomation

Why AI Ethernet Fabrics Need More Than Bandwidth

The networking industry spent the last decade optimising for throughput. AI workloads are forcing a different conversation. When a thousand-GPU training job runs an all-reduce collective over RoCE v2, the fabric must deliver not just high bandwidth but predictable, lossless delivery. A single packet drop can stall a gradient synchronisation across the entire cluster, wasting GPU cycles that cost real money.

This is the context in which DCBX, PFC, ECN, and Fast CNP have moved from niche data centre bridging specifications to the operational baseline for AI Ethernet fabrics. Each protocol addresses a different layer of the congestion problem, and together they form a stack that allows RoCE v2 to function reliably at scale.

For Australian enterprises investing in private AI infrastructure - from hyperscaler-style GPU clusters to on-premises inference deployments - understanding this stack is no longer optional. The congestion control choices made at fabric design time directly affect training throughput, inference latency, and GPU utilisation.

DCBX: The Negotiation Layer Most Teams Skip

Data Center Bridging Capabilities Exchange Protocol (DCBX) is an IEEE 802.1Qaz link-layer discovery protocol that allows adjacent switches and endpoints to advertise and negotiate DCB parameters. It is the handshake that ensures both sides of a link agree on Priority Flow Control settings, ETS (Enhanced Transmission Selection) traffic classes, and ECN behaviour before any production traffic flows.

In practice, DCBX is the protocol that prevents configuration drift across a fabric. Without it, a misaligned PFC configuration between a leaf switch and a server NIC can cause head-of-line blocking, uncontrolled pause frame storms, or silent drops that surface as mysterious training job failures.

SONiC - the open-source network operating system that underpins many cloud provider and open networking deployments - implements DCBX as part of its data centre bridging subsystem. The SONiC Foundation, a Linux Foundation project, describes SONiC as offering a full suite of network functionality including RDMA, production-hardened in the data centres of some of the largest cloud service providers. For organisations evaluating open networking as the foundation for AI fabrics, DCBX support in the NOS is a prerequisite, not a nice-to-have.

PFC: Per-Priority Flow Control for Lossless Delivery

Priority Flow Control (IEEE 802.1Qbb) extends Ethernet flow control from an all-or-nothing mechanism to per-traffic-class granularity. A switch can send a PFC pause frame on a specific priority queue - typically the RDMA traffic class - while leaving other queues (management, storage, general compute) unaffected.

PFC is the mechanism that creates lossless Ethernet behaviour for RoCE v2 traffic. When a switch egress buffer approaches a configured threshold on the RDMA priority, it sends a PFC pause frame upstream, telling the sender to stop transmitting on that priority. The sender pauses only that priority queue, keeping other traffic flowing normally.

The challenge with PFC is that it is a blunt instrument for congestion control. PFC pause frames propagate hop-by-hop toward the traffic source, which can cause congestion spreading (sometimes called PFC storms or incast-induced pause propagation) across the fabric. This is why PFC alone is insufficient for AI-scale RoCE v2 fabrics. It needs to be paired with end-to-end congestion notification.

ECN and Fast CNP: From Detection to Rapid Response

Explicit Congestion Notification (RFC 3168, extended for data centre use) adds IP-layer congestion signalling that PFC cannot provide. When a switch detects queue buildup beyond a configured threshold, it sets the ECN CE (Congestion Experienced) bits in the IP header of transit packets. The receiver detects the ECN marking and generates a Congestion Notification Packet (CNP) back to the sender, which then reduces its transmission rate.

This is the end-to-end congestion control loop that prevents PFC from being the only line of defence. ECN provides early warning before queues fill enough to trigger PFC, allowing the RDMA NIC (RNIC) to reduce injection rate proactively.

Fast CNP is the acceleration of this loop. In standard ECN-based RoCE v2 congestion control, there is latency between the switch marking a packet, the receiver generating a CNP, and the sender reacting. Fast CNP compresses this timeline by generating CNPs closer to the congestion point - either at the switch itself or through a faster receiver-side mechanism - so the sender reduces its rate sooner.

The result is tighter congestion control with less buffer consumption, which means fewer PFC pause events, less congestion spreading, and more predictable tail latency for GPU collectives. For AI training jobs that run across hundreds or thousands of GPUs, this predictability directly translates to higher effective GPU utilisation.

The Stack in Practice: How the Four Layers Interact

The congestion control stack for AI Ethernet fabrics operates as a layered defence:

  1. DCBX negotiates parameters at link-up time so both sides of every link agree on PFC priorities, ETS bandwidth allocation, and ECN thresholds. This is the foundation that prevents misconfiguration.

  2. ETS (Enhanced Transmission Selection, IEEE 802.1Qaz) allocates guaranteed bandwidth to traffic classes, ensuring RDMA traffic gets the priority it needs without starving other workloads.

  3. ECN provides end-to-end congestion notification by marking packets at switch egress ports when queue depth exceeds thresholds. This is the early warning system.

  4. Fast CNP accelerates the congestion response loop so senders reduce rate quickly, minimising buffer consumption and reducing the probability of PFC activation.

  5. PFC acts as the backstop - a last-resort pause mechanism that prevents packet loss when congestion exceeds what ECN/Fast CNP can control.

  6. PFC Watchdog (a SONiC and vendor-specific feature) monitors for PFC storms and can disable PFC on affected ports to prevent fabric-wide head-of-line blocking.

When this stack is correctly tuned, PFC activations should be rare. The fabric should operate primarily on ECN and Fast CNP-driven rate control, with PFC providing insurance against edge cases like micro-bursts or link failures.

For Australian organisations designing AI fabrics, this layered model is the starting point for any network design discussion - whether the NOS is SONiC, Cumulus, or a proprietary platform.

Open Networking and SONiC: What Changes for AI Fabric Buyers

The SONiC Foundation, operating as a Linux Foundation project, describes SONiC as an open-source network operating system based on Linux that runs on switches from multiple vendors and ASICs. Its key features include multi-vendor support, container-based architecture, standard Linux interfaces, and a programmable framework.

For AI fabric buyers, the significance of SONiC is not just that it is open source - it is that the DCBX, PFC, ECN, and RDMA stack is implemented in a NOS that is not locked to a single hardware vendor. This matters for three reasons:

First, ASIC choice. Different AI fabric sizes and topologies benefit from different switch ASICs. Open networking lets the buyer choose the ASIC that fits their scale and latency requirements without being locked into a proprietary NOS.

Second, operational consistency. SONiC’s container-based architecture means the congestion control subsystem can be updated, validated, and rolled out independently of other NOS components. For teams operating GPU clusters that run continuous training workloads, this modularity reduces the blast radius of NOS changes.

Third, community velocity. The SONiC community, including major cloud providers and networking vendors, continuously improves the RDMA and congestion control stack. Features like PFC watchdog, ECN tuning, and DCBX enhancements flow into the community codebase, benefiting all adopters.

NVIDIA explicitly supports Pure SONiC alongside Cumulus Linux on its Spectrum Ethernet switch portfolio, signalling that even vendors with strong proprietary NOS offerings see SONiC as a production-grade option for AI networking.

For the Australian market, where enterprise AI adoption is accelerating but data centre footprints are smaller than US or APAC hyperscalers, the ability to build AI fabrics on open networking hardware with a well-understood congestion control stack is a practical advantage. It reduces vendor lock-in risk and provides access to a broader supply chain for switch hardware.

What to Ask Before Choosing an AI Fabric Congestion Control Approach

For network architects evaluating AI Ethernet fabric designs, the following questions separate mature congestion control implementations from checkbox implementations:

  • Does the NOS support full DCBX negotiation, including PFC and ETS parameter exchange, or does it rely on static configuration?
  • What ECN marking thresholds are recommended for AI training vs. inference workloads, and can they be tuned per-queue?
  • Is Fast CNP implemented at the switch level (switch-generated CNP) or only at the RNIC level?
  • Does the PFC watchdog mechanism detect and mitigate PFC storms within a timeframe that prevents GPU collective stalls?
  • Can INT (In-band Network Telemetry) or equivalent visibility be layered on top to provide real-time congestion observability across the fabric?
  • Is the congestion control stack validated against the specific RNIC vendor and firmware version in the deployment (e.g., NVIDIA ConnectX, Broadcom P-series)?
  • What is the NOS upgrade path for congestion control features, and does the container architecture allow independent updates?

These questions apply whether the buyer is evaluating a proprietary vendor stack, an open networking platform like SONiC, or a hybrid approach. The difference is that open networking platforms make the answers verifiable - the source code is available, the community provides peer review, and the buyer is not dependent on a single vendor’s roadmap.

The Australian Angle: AI Infrastructure Investment Meets Open Networking

Australian enterprises and government agencies are investing in AI infrastructure at an accelerating pace. Data centre capacity in Sydney, Melbourne, and Brisbane is expanding to support GPU-as-a-service, private LLM inference, and AI-driven analytics workloads.

For these deployments, the congestion control stack is a critical design decision that affects total cost of ownership. A fabric that relies solely on PFC without ECN/Fast CNP will consume more buffer memory, require larger switch ASICs, and deliver less predictable performance. A fabric with a well-tuned DCBX/ECN/Fast CNP stack can operate with smaller buffers, lower tail latency, and higher GPU utilisation.

Open networking - SONiC on disaggregated switch hardware - offers Australian buyers a path to AI fabric networking that avoids single-vendor lock-in while maintaining access to the same congestion control capabilities that hyperscalers use in production. This is particularly relevant for organisations that want to run AI infrastructure on-premises or in colocation facilities rather than relying entirely on public cloud GPU instances.

The xSONIC data centre AI switch portfolio, combined with Enterprise SONiC and the DCBX/Fast CNP solution framework, positions open networking as a credible foundation for AI fabric deployments at Australian enterprise scale.

Sources Reviewed