The observability blind spot inside GPU clusters
Australian enterprises deploying AI training and inference workloads are running into a problem that traditional network monitoring was never designed to solve. SNMP polling intervals measured in minutes and sampled sFlow exports cannot capture the microsecond-level latency spikes and transient congestion events that stall collective operations inside GPU clusters.
When a RoCE v2 RDMA write stalls for even a few hundred microseconds, the result is not a slow web page. It is a synchronisation barrier that idles dozens or hundreds of GPUs waiting for the lagging flow to complete. At scale, those micro-events compound into hours of wasted GPU time across a training run.
SONiC — the open source network operating system originally hardened inside Microsoft Azure and now governed by the SONiC Foundation under the Linux Foundation — has become the NOS of choice for a growing number of AI fabric deployments. Its containerised, SAI-based architecture decouples switch software from the underlying ASIC, giving network teams a common operational model across hardware from multiple vendors. That portability is exactly what makes INT telemetry viable at scale: the data plane instrumentation travels with the packet, not with a proprietary management appliance.
What INT actually does inside a packet
In-band Network Telemetry, originally specified by the P4 Language Consortium and the Open Networking Foundation, embeds telemetry requests directly into the packet header. As the packet traverses each switch hop, the forwarding ASIC inserts metadata — typically ingress and egress timestamps, queue occupancy, ingress and egress port IDs, and optionally congestion notification flags — into a telemetry stack carried by the packet itself.
The destination NIC or a dedicated collector node reads the accumulated telemetry stack and exports it to a time-series analytics backend. The result is a hop-by-hop latency and congestion trace for every flow, not a sampled subset, and not a five-minute average.
Path telemetry extends this model with per-flow path discovery. Rather than relying on control-plane routing state to infer where packets travelled, path telemetry instruments the data plane to report the actual forwarding path, including ECMP member selection. For AI fabrics where dozens of equal-cost paths exist between any two endpoints, knowing the exact path a flow took during a congestion event is the difference between a root-cause diagnosis and a guessing game.
Why this matters more in AI fabrics than in general-purpose DCs
Three characteristics of AI/ML cluster traffic make INT and path telemetry especially valuable:
1. Flow patterns are predictable but latency-sensitive. AI training jobs generate large, long-lived elephant flows between GPU servers during collective all-reduce operations. These flows are latency-intolerant: even small variance in tail latency translates directly into wasted compute cycles. INT gives operators per-hop, per-flow latency attribution that SNMP-based tools simply cannot provide.
2. RoCE v2 relies on the network for lossless or near-lossless delivery. RDMA over Converged Ethernet v2 depends on Priority Flow Control, DCBX negotiation, and Explicit Congestion Notification to avoid packet drops. INT surfaces where PFC pause frames are being triggered, which queues are filling, and where ECN marks are being applied — all in real time, per packet. Without INT, operators are diagnosing RoCE congestion with indirect signals like switch CPU utilisation or PFC counter polling, which is both slow and imprecise.
3. ECMP entropy and hash polarization are silent killers. In a leaf-spine fabric with 32 or 64 ECMP paths, poor hash distribution can funnel elephant flows onto the same spine link while other links sit idle. Path telemetry reveals the actual forwarding path selection per flow, enabling operators to tune hash algorithms or adjust fabric topology before training job throughput degrades.
The Australian market context
Australia’s AI infrastructure buildout is accelerating. Hyperscale cloud providers — AWS in Sydney, Azure in Melbourne and Canberra, Google Cloud in Sydney — have expanded local GPU capacity. Domestic colocation providers and managed service providers are adding GPU-as-a-service offerings. Government agencies under the Australian Signals Directorate’s Essential Eight framework and the Privacy Act review are increasingly requiring network-level auditability.
For Australian buyers evaluating a new AI fabric or upgrading an existing spine-leaf deployment, the observability question is no longer optional. The question is: do you buy a proprietary telemetry stack from your switch vendor, or do you deploy an open, ASIC-agnostic telemetry plane on top of an open NOS?
This is where the SONiC and INT combination becomes a strategic decision, not just a technical one.
How INT works on SONiC-based fabrics
SONiC’s INT implementation leverages the SAI (Switch Abstraction Interface) telemetry APIs. On supported ASICs — including Broadcom Memory-class and Marvell Teralynx families, and NVIDIA Spectrum devices that also support Pure SONiC — the INT sink and source functions are configured through SONiC’s configuration framework. The switch inserts INT metadata at the fabric ingress (source), accumulates it at each hop (transit), and exports it at the egress leaf or at a dedicated monitoring endpoint (sink).
The telemetry data is typically exported via IPFIX or gRPC streaming to an analytics platform. Several open source and commercial collectors can ingest INT data, and the SONiC community has been actively contributing INT-related improvements to the SAI specification.
For path telemetry, the implementation is similar but focuses on path identification rather than latency measurement. Each switch appends its switch ID and egress port to a path stack in the packet header, giving the collector a complete forwarding path record for each flow.
xSONiC data center AI switches ship with Enterprise SONiC that supports INT and path telemetry on qualified hardware. The combination of open NOS, open telemetry instrumentation, and multi-vendor ASIC support means Australian operators are not locked into a single vendor’s management plane to get hop-by-hop visibility.
The packet broker angle
Not every monitoring use case requires INT at the switch level. For passive traffic analysis, security inspection, and compliance logging, network packet brokers aggregate, filter, and replicate traffic to monitoring tools. In AI data center environments, packet brokers sit at the fabric edge to capture inter-tenant traffic, export copies to DLP and IDS appliances, and feed traffic metadata to SIEM platforms.
The relationship between INT and packet brokering is complementary: INT provides the active, per-packet telemetry from the forwarding plane, while packet brokers provide the passive traffic mirroring and tool delivery layer. Australian operators building out AI infrastructure should consider both layers as part of their observability and security architecture.
Buyer checklist: INT readiness for AI fabric evaluation
When evaluating switch hardware and NOS platforms for an AI fabric deployment in Australia, the following INT and path telemetry capabilities should be on the buyer checklist:
| Capability | Why it matters | What to verify |
|---|---|---|
| INT source/transit/sink support | Enables hop-by-hop latency and congestion attribution | Confirm SAI INT API support on target ASIC |
| Per-flow path telemetry | Reveals actual ECMP forwarding path per flow | Confirm path ID insertion at each hop |
| PFC and ECN event correlation | Links INT telemetry to RoCE congestion signals | Verify DCBX and PFC counter export alongside INT |
| Telemetry export format | Determines compatibility with analytics backend | IPFIX, gRPC, or streaming telemetry support |
| Multi-vendor ASIC portability | Prevents lock-in to a single silicon vendor | Verify SAI compliance across target switch models |
| Packet broker integration | Complements INT with passive traffic capture | Confirm mirror/session support and broker interoperability |
The analysis does make the source-backed argument that SONiC is a production-hardened open source NOS with growing INT support, that AI fabric traffic patterns demand per-flow visibility that legacy monitoring cannot deliver, and that Australian operators have both a technical and strategic reason to evaluate open telemetry stacks on open NOS platforms.
Next steps for Australian buyers
Organisations planning or upgrading AI data center fabrics in Australia should:
- Map their current observability stack against INT and path telemetry capabilities.
- Evaluate whether their switch ASIC and NOS combination supports SAI INT APIs.
- Assess the analytics backend required to ingest and visualise INT data at scale.
- Consider the total cost of telemetry — including the cost of not having per-flow visibility when a training job stalls.
xSONiC publishes solution guides for INT technology, IPTPath telemetry, AI fabric design, and RoCE v2 optimisation. These resources are available at the xSONiC solutions pages and are a starting point for technical evaluation.
Related xSONiC Resources
Sources Reviewed
- SONiC Foundation: https://sonicfoundation.dev/
- Supports: input source for finding, recommendation, claim, and evidence review.
- SONiC GitHub: https://github.com/sonic-net/SONiC
- Supports: input source for finding, recommendation, claim, and evidence review.
- Azure SONiC Documentation: https://azure.github.io/SONiC
- Supports: input source for finding, recommendation, claim, and evidence review.
- Open Compute Networking: https://www.opencompute.org/projects/networking
- Supports: input source for finding, recommendation, claim, and evidence review.
- Broadcom Ethernet Switching: https://www.broadcom.com/products/ethernet-connectivity/switching
- Supports: input source for finding, recommendation, claim, and evidence review.
- Marvell Switching: https://www.marvell.com/products/switching.html
- Supports: input source for finding, recommendation, claim, and evidence review.
- NVIDIA Ethernet Switching: https://www.nvidia.com/en-us/networking/ethernet-switching
- Supports: input source for finding, recommendation, claim, and evidence review.
- Continue: https://www.nvidia.com/
- Supports: input source for finding, recommendation, claim, and evidence review.