Blog

Why INT and Path Telemetry Matter for SONiC-Based AI Data Center Observability

In-band Network Telemetry (INT) and IPTPath telemetry are emerging as critical observability layers for AI data center fabrics running SONiC. This analysis brief examines why traditional SNMP and polling-based monitoring

By xSONiC Team · · SONiCopen networkingdata centerAI fabricEthernetautomation

The Observability Gap in AI Data Center Fabrics

AI training and inference clusters impose traffic patterns that traditional data center monitoring tools were not designed to handle. GPU-to-GPU communication during distributed training generates sustained, high-bandwidth flows that can saturate links for hours. Microbursts, RDMA congestion notifications, and packet drops that would be invisible in a standard enterprise network can stall a training job worth thousands of GPU-hours.

Most legacy monitoring stacks rely on SNMP polling intervals of 30 to 300 seconds, or sampled NetFlow data that misses short-lived congestion events. For RoCE v2 traffic carrying RDMA workloads, even a single dropped packet can trigger a timeout and retransmission, degrading throughput across an entire collective operation. This creates a situation where the network team has no visibility into the exact hop where the problem occurred.

SONiC, the open-source network operating system hosted under the Linux Foundation, supports a full suite of data center protocols including BGP, RDMA, and containerized architecture that runs on switches from multiple vendors and ASICs (sonicfoundation.dev). However, the observability features available in a SONiC deployment depend heavily on the silicon capability of the underlying switch hardware and the telemetry features enabled in the SONiC software stack.

What Is In-Band Network Telemetry (INT)?

In-band Network Telemetry (INT) is a data plane telemetry framework originally defined in the P4/INT specification by the P4 Language Consortium and the ONF. Unlike traditional out-of-band monitoring that polls switch counters at intervals, INT embeds metadata directly into data packets as they traverse the network. Each switch along the path appends information such as hop latency, queue depth, ingress/egress port identifiers, and link utilization into an INT header or metadata stack carried by the packet itself.

The result is per-packet, per-hop visibility. When an AI training flow experiences increased latency or packet loss, an INT-enabled fabric can pinpoint exactly which switch, which port, and which queue caused the degradation — without relying on sampled data or guessing from aggregated counters.

INT operates at the ASIC level, meaning the switch silicon must support INT header insertion, transit, and sink operations. Major silicon vendors have incorporated INT support into their data center switch ASICs. The SONiC community has been working on INT integration as part of its broader telemetry architecture, though the maturity and completeness of INT support varies across SONiC distributions and hardware platforms.

For Australian operators evaluating open networking for AI fabrics, INT support is a hardware and software capability question, not just a NOS feature checkbox. The switch ASIC, the SONiC distribution, and the telemetry collector must all align for end-to-end INT to function.

IPTPath Telemetry: Complementing INT with Path-Level Visibility

While INT provides per-packet, per-hop metadata, IPTPath telemetry extends the visibility model to include path-level analysis. IPTPath telemetry aggregates per-hop INT data into a complete path trace, showing the exact route a flow took through the fabric and the performance characteristics at each point along that path.

For AI fabric operators, this matters because:

  • ECMP path variance: In leaf-spine fabrics with equal-cost multi-path routing, different flows within the same training job may traverse different paths. IPTPath telemetry reveals whether some paths are consistently more loaded or latent than others.

  • Congestion root-cause analysis: When multiple GPUs report timeouts simultaneously, path telemetry can distinguish between a single hot-spot switch and a distributed congestion event across many hops.

  • RoCE v2 congestion correlation: By combining INT hop data with path-level flow traces, operators can correlate RDMA congestion notifications (CNP) with the specific network hop that triggered them. This is critical for tuning DCBX, PFC, and ECN parameters in RoCE v2 fabrics.

xSONiC positions both INT Telemetry and IPTPath Telemetry as named solution pillars for data center deployments. The combination addresses a gap that proprietary networking vendors often fill with closed-source fabric controllers and analytics platforms, locking customers into vertically integrated stacks.

The SONiC Observability Stack: What Exists Today

The SONiC community and the broader open networking ecosystem offer several observability building blocks, but they are not all integrated into a single turnkey platform:

SONiC Streaming Telemetry (gNMI/gRPC): SONiC supports streaming telemetry via gNMI subscriptions, allowing external collectors to receive real-time counter data from switches at configurable intervals. This is a significant improvement over SNMP polling but still operates at the counter level, not the per-packet level.

SAI (Switch Abstraction Interface) Telemetry Extensions: The SAI API, which abstracts ASIC-specific operations for SONiC, includes telemetry object models for INT. However, the actual implementation depends on the ASIC vendor’s SAI driver. Not all switch platforms running SONiC have production-ready INT SAI implementations.

NVIDIA NetQ: NVIDIA’s NetQ platform provides real-time visibility, troubleshooting, and lifecycle management for data center networks and is available for SONiC-based deployments alongside Cumulus Linux (nvidia.com). NetQ offers validated telemetry workflows but is tied to NVIDIA’s networking ecosystem.

Open-Source Collectors: Tools like Telegraf, Prometheus, and OpenTelemetry can ingest SONiC streaming telemetry data. However, parsing INT metadata and building path-level views from INT headers typically requires additional tooling or custom development.

For Australian enterprises, the practical reality is that deploying INT and path telemetry on a SONiC fabric requires assembling multiple components: INT-capable switch hardware, a SONiC distribution with INT support, a telemetry collector that understands INT headers, and a visualization or alerting layer. This is achievable with open networking, but it is not as turnkey as a proprietary fabric controller offering.

Why Australian AI Data Center Operators Should Care

Australia’s data center market is expanding rapidly, driven by hyperscale cloud builds, sovereign AI requirements, and enterprise adoption of GPU-accelerated workloads. Several factors make INT and path telemetry particularly relevant for Australian operators:

Geographic latency: Australian AI clusters often connect across multiple availability zones or metro sites. Understanding per-hop latency at the network level helps operators distinguish between fabric congestion and geographic propagation delay, especially for distributed training jobs that span racks or sites.

Sovereign data requirements: Australian organizations subject to data sovereignty regulations need observability into their own fabric rather than relying on cloud-provider telemetry that may not expose network-level detail. INT on a locally operated SONiC fabric gives operators the visibility they own and control.

Vendor diversification: Australian enterprises have historically been heavily dependent on a small number of incumbent networking vendors. Open networking with SONiC, combined with INT-based observability, offers a path to hardware diversification without sacrificing the granular visibility that AI workloads demand.

Cost of blind spots: In AI infrastructure, network blind spots are not just an operational inconvenience. A single undetected congestion point can waste hours of GPU compute time across a training cluster. The economic argument for per-packet telemetry is strongest when the cost of the workloads traversing the network is highest — and GPU-hour costs in Australian data centers are significant.

Australian operators evaluating xSONiC data center switches for AI fabric deployments should ask vendors specifically about INT and path telemetry support at the ASIC level, the SONiC version and distribution, and the integration path to their existing monitoring stack.

What to Watch: The INT Telemetry Roadmap for SONiC

The INT telemetry story in SONiC is evolving. Several developments are worth monitoring:

  1. SONiC community INT progress: The SONiC community has been iterating on telemetry support through SAI and SONiC observability workstreams. Operators should track release notes, SAI capability matrices, and hardware validation notes for the specific switch ASICs under consideration.

  2. ASIC vendor INT maturity: Not all switch ASICs implement INT equally. Operators should verify production-grade INT or path telemetry support in the SAI driver, confirm the supported metadata fields, and validate export behaviour in a lab before relying on telemetry for operational alerting.

  3. Open-source INT collectors and visualization: The P4/INT ecosystem has reference collectors, but production-grade path telemetry visualization for SONiC is still a developing area. Projects and vendor tools that bridge this gap will be critical for adoption.

  4. Convergence with eBPF-based observability: The Linux kernel’s eBPF subsystem is increasingly used for network observability on hosts. The intersection of host-side eBPF telemetry and switch-side INT metadata could provide end-to-end visibility from the application to the ASIC. This is an emerging area, not yet standard in SONiC deployments.

  5. NVIDIA Spectrum platform integration: NVIDIA’s Spectrum Ethernet switches support SONiC and offer NetQ for network observability (nvidia.com). Whether and how INT data from Spectrum ASICs integrates with NetQ or open-source collectors is a relevant question for operators considering the NVIDIA+SONiC path.

xSONiC will continue tracking INT and path telemetry maturity across the SONiC ecosystem as community releases, silicon support, and production tooling evolve. For buyers planning AI fabric, GPU backend, RoCE v2, DCBX, or Fast CNP deployments, telemetry readiness should be evaluated alongside port speed and switching capacity.

Sources Reviewed