Why AI Data Centers Need More Than SNMP and sFlow
Traditional network monitoring tools were built for a world where traffic patterns were predictable and milliseconds did not matter. SNMP polling at 60-second intervals. sFlow sampling at 1-in-1024 packets. These approaches served enterprise campus networks and north-south data center traffic well enough for two decades.
AI training clusters break that model completely.
When a distributed GPU training job runs across hundreds of accelerators, network congestion at any single hop can stall an entire collective operation. A microburst lasting 50 microseconds on a leaf-spine link can cause RDMA queue backpressure that propagates across the fabric, freezing GPU synchronization and wasting expensive compute cycles. Traditional polling intervals miss these events entirely. By the time SNMP reports elevated interface counters, the training job has already recovered or failed.
This is the observability gap that In-band Network Telemetry (INT) and path telemetry were designed to close. For Australian enterprises building private AI infrastructure on open SONiC-based fabrics, understanding these technologies is not optional — it is the difference between a GPU cluster that delivers predictable training throughput and one that silently degrades under load.
What Is In-band Network Telemetry (INT)?
INT is a network telemetry approach where switches embed operational metadata directly into data packets as those packets traverse the network. Rather than relying on separate polling or sampling mechanisms, INT turns every monitored packet into its own monitoring report.
The original INT specification was developed through the P4.org consortium and the Open Networking Foundation. It defines a framework where:
- A source node inserts an INT header into the packet at the network edge.
- Each transit switch along the path appends a metadata shim containing local switch information.
- The destination node (or a sink node) extracts the INT data and sends it to a telemetry collector for analysis.
The metadata captured at each hop can include:
- Switch ID and ingress/egress port identifiers
- Hop latency (the time the packet spent at that switch)
- Queue occupancy at the moment of transit
- Egress timestamp
- Link utilization indicators
- Buffer congestion state
This gives network operators a per-packet, hop-by-hop view of exactly what happened to traffic as it crossed the fabric. For AI workloads that are highly sensitive to tail latency and congestion events, this level of detail is critical.
How Path Telemetry Extends the Visibility Model
Path telemetry builds on the INT concept but focuses on end-to-end path characteristics rather than per-hop metadata alone. Where INT answers the question “what happened at each switch?”, path telemetry answers “what was the total experience of this flow from source to destination?”
Path telemetry implementations typically measure:
- End-to-end latency between source and destination
- Path identification (which spine and leaf switches carried the flow)
- Packet loss events and their location in the path
- Path changes (failover events, ECMP rebalancing)
- Congestion correlation across multiple flows sharing the same path segments
Together, INT and path telemetry provide a two-layer observability model. INT gives the granular hop-level detail needed for root-cause analysis. Path telemetry gives the aggregated flow-level view needed for capacity planning, SLA verification, and anomaly detection.
For AI fabric operators, this combination enables answers to questions like:
- Which leaf switch is introducing 15 microseconds of additional latency during collective all-reduce operations?
- Did a spine link failover cause path changes that affected GPU-to-GPU communication for a specific training job?
- Are microbursts on a particular egress port correlating with job completion time variance?
Why This Matters for AI Fabric Networking
GPU training and inference workloads generate traffic patterns that are fundamentally different from traditional enterprise applications. Key characteristics include:
- Elephant flows: Large, long-lived RDMA transfers between GPUs that consume significant bandwidth on specific links.
- Incast patterns: All-reduce and all-gather collective operations that synchronize many GPUs simultaneously, creating synchronized bursts at leaf and spine switches.
- Loss sensitivity: RoCEv2 traffic requires a lossless or near-lossless fabric. Even a single packet drop can trigger timeout-based retransmissions that stall GPU operations for milliseconds.
- Latency requirements: AI collective operations are latency-bound, not just bandwidth-bound. Hop latency variations of even a few microseconds can compound across a multi-stage fabric.
Traditional monitoring approaches fail across all four dimensions. SNMP polling cannot detect microbursts. sFlow sampling misses short-lived congestion events. Streaming telemetry (gNMI/gRPC) provides better granularity but still relies on periodic counter snapshots rather than per-packet instrumentation.
INT and path telemetry address this by providing event-driven, per-packet visibility that is synchronized with the actual traffic flow. When a RoCEv2 RDMA write traverses six hops across a leaf-spine-leaf fabric, INT metadata tells you exactly which hop introduced queuing delay and how long it persisted.
INT on SONiC: What the Open Networking Ecosystem Provides
SONiC (Software for Open Networking in the Cloud) is an open-source network operating system that runs on switches from multiple hardware vendors and ASIC families. It is a Linux Foundation project with broad industry support and has been production-hardened in some of the largest cloud-scale data centers in the world.
For INT and path telemetry on SONiC-based fabrics, the architecture provides several relevant capabilities:
- SAI (Switch Abstraction Interface): SONiC’s hardware abstraction layer, based on SAI, supports telemetry-related ASIC features across multiple chip vendors. This means INT metadata insertion, transit, and extraction can be implemented on different switch platforms running SONiC without rewriting the control plane.
- Containerized architecture: SONiC’s Docker-container-based design isolates the telemetry collection and export functions from the core switching plane, allowing operators to upgrade telemetry agents independently without disrupting traffic forwarding.
- Programmatic configuration: SONiC supports both CLI and programmatic configuration via JSON-based config files, enabling automated telemetry policy deployment across large fabrics.
- Multi-vendor hardware support: Because SONiC decouples the NOS from the underlying switch hardware via SAI, enterprises can select ASIC platforms that best support INT metadata fields and sampling rates for their specific AI workload profiles.
For Australian enterprises, this open ecosystem approach has a practical advantage: you are not locked into a single vendor’s telemetry stack. If your AI fabric grows and you need to add capacity with a different switch platform, the telemetry framework remains consistent.
Practical Deployment Considerations for Australian AI Data Centers
Fabric Design for Telemetry Visibility
INT and path telemetry work best in fabrics with well-defined hop-by-hop paths. Leaf-spine (Clos) topologies are ideal because they provide deterministic multi-path routing and clear hop boundaries. For AI clusters, this aligns naturally with GPU backend fabric designs where predictable east-west traffic flow is essential.
When planning telemetry coverage, consider:
- Which flows to instrument: Not every packet needs INT headers. For AI fabrics, prioritize RoCEv2 traffic carrying RDMA operations, as these are the latency- and loss-sensitive flows that directly impact training job performance.
- INT header overhead: Each INT hop adds metadata to the packet header. In a 3-tier leaf-spine-super-spine fabric, this can add 20-40 bytes per hop. Plan your MTU accordingly, especially when using jumbo frames for RDMA traffic.
- Sink node placement: The telemetry sink (where INT data is extracted and exported) should be positioned to avoid creating a new bottleneck. Many deployments use dedicated packet broker infrastructure to aggregate and distribute telemetry streams without impacting production traffic.
Integrating with Existing Monitoring Stacks
INT and path telemetry do not replace your existing monitoring infrastructure — they augment it. A practical deployment typically combines:
- INT/path telemetry data for per-packet, hop-by-hop visibility into fabric congestion and latency.
- Streaming telemetry (gNMI) for periodic counter and state collection from switches.
- Packet broker infrastructure for traffic mirroring, aggregation, and delivery to security and analytics tools.
For Australian data centers operating under data sovereignty requirements, keeping telemetry collection and analysis infrastructure onshore is straightforward. The telemetry collector is typically a software component that can run on local infrastructure.
Aligning with DCBX and RoCEv2 Configuration
INT telemetry works alongside — not instead of — the DCBX and RoCEv2 configurations that make the fabric lossless. DCBX negotiation establishes priority flow control (PFC) and explicit congestion notification (ECN) settings that prevent packet drops on RDMA traffic. INT then provides the visibility into whether those mechanisms are working as expected.
A practical example: DCBX configures PFC on priority 3 for RoCEv2 traffic. If INT metadata shows increasing queue occupancy at a specific leaf switch on priority 3, the operator knows that PFC backpressure is building and can investigate whether the issue is a bandwidth mismatch, an oversubscribed uplink, or a misconfigured ECN threshold.
- Confirm xSONIC switch models and ASIC families that explicitly support INT metadata insertion and transit fields. Do not publish without verified hardware capability statements.
- Add specific xSONIC product links once product page slugs are finalized for data center AI switches with INT-capable ASICs.
- Verify whether xSONIC AIDC Controller supports INT telemetry collection and visualization. If not, propose integration with open-source INT collectors (e.g., Broadcom InMon, Intel telemetry exporters).
- Add Australian-specific data sovereignty compliance notes (e.g., Privacy Act 1988, Critical Infrastructure Act) if the article will be localized for the Australian market.
- Include a decision table comparing INT vs. streaming telemetry vs. sFlow for AI fabric use cases.
- Confirm xSONIC 800G optical transceiver models suitable for AI fabric uplinks where INT is deployed.
- Add internal links to xSONIC RoCE v2 guide, DCBX technology page, Fast CNP page, and AI Fabric solution page once those pages are live.
- Verify specific INT metadata field support (switch ID, hop latency, queue depth, ingress/egress port) on xSONIC-supported ASIC platforms.
- Consider adding a “Frequently Asked Questions” section for SEO if the article is deployed as a guide or solution page.
Next Steps
If you are evaluating open networking for an AI data center fabric and need INT and path telemetry capabilities, the xSONIC team can help you assess which switch platforms and ASIC configurations support the telemetry features your workloads require.
- Explore xSONIC Data Center AI Switches for AI fabric deployments.
- Review the xSONIC INT Telemetry solution guide for detailed implementation guidance.
- Learn about IPTPath Telemetry for end-to-end path visibility.
- Understand how RoCE v2 and DCBX configure the lossless fabric that INT monitors.
- Contact xSONIC for a fabric assessment or telemetry architecture review for your Australian data center.
Related xSONiC Resources
Sources Reviewed
- SONiC Foundation: https://sonicfoundation.dev/
- Supports: input source for finding, recommendation, claim, and evidence review.
- SONiC GitHub: https://github.com/sonic-net/SONiC
- Supports: input source for finding, recommendation, claim, and evidence review.
- Azure SONiC Documentation: https://azure.github.io/SONiC
- Supports: input source for finding, recommendation, claim, and evidence review.
- Open Compute Networking: https://www.opencompute.org/projects/networking
- Supports: input source for finding, recommendation, claim, and evidence review.
- Broadcom Ethernet Switching: https://www.broadcom.com/products/ethernet-connectivity/switching
- Supports: input source for finding, recommendation, claim, and evidence review.
- Marvell Switching: https://www.marvell.com/products/switching.html
- Supports: input source for finding, recommendation, claim, and evidence review.
- NVIDIA Ethernet Switching: https://www.nvidia.com/en-us/networking/ethernet-switching
- Supports: input source for finding, recommendation, claim, and evidence review.
- Continue: https://www.nvidia.com/
- Supports: input source for finding, recommendation, claim, and evidence review.