Why INT and Path Telemetry Are Becoming Non-Negotiable for

Australian enterprises deploying AI training and inference workloads are running into a problem that traditional network monitoring was never designed to solve. SNMP polling intervals measured in minutes and sampled sFlow exports cannot capture the microsecond-level latency spikes and transient congestion events that stall collective operations inside GPU clusters.

When a RoCE v2 RDMA write stalls for even a few hundred microseconds, the result is not a slow web page. It is a synchronisation barrier that idles dozens or hundreds of GPUs waiting for the lagging flow to complete. At scale, those micro-events compound into hours of wasted GPU time across a training run.

SONiC — the open source network operating system originally hardened inside Microsoft Azure and now governed by the SONiC Foundation under the Linux Foundation — has become the NOS of choice for a growing number of AI fabric deployments. Its containerised, SAI-based architecture decouples switch software from the underlying ASIC, giving network teams a common operational model across hardware from multiple vendors. That portability is exactly what makes INT telemetry viable at scale: the data plane instrumentation travels with the packet, not with a proprietary management appliance.

What INT actually does inside a packet

In-band Network Telemetry, originally specified by the P4 Language Consortium and the Open Networking Foundation, embeds telemetry requests directly into the packet header. As the packet traverses each switch hop, the forwarding ASIC inserts metadata — typically ingress and egress timestamps, queue occupancy, ingress and egress port IDs, and optionally congestion notification flags — into a telemetry stack carried by the packet itself.

The destination NIC or a dedicated collector node reads the accumulated telemetry stack and exports it to a time-series analytics backend. The result is a hop-by-hop latency and congestion trace for every flow, not a sampled subset, and not a five-minute average.

Path telemetry extends this model with per-flow path discovery. Rather than relying on control-plane routing state to infer where packets travelled, path telemetry instruments the data plane to report the actual forwarding path, including ECMP member selection. For AI fabrics where dozens of equal-cost paths exist between any two endpoints, knowing the exact path a flow took during a congestion event is the difference between a root-cause diagnosis and a guessing game.

Why this matters more in AI fabrics than in general-purpose DCs

Three characteristics of AI/ML cluster traffic make INT and path telemetry especially valuable:

1. Flow patterns are predictable but latency-sensitive. AI training jobs generate large, long-lived elephant flows between GPU servers during collective all-reduce operations. These flows are latency-intolerant: even small variance in tail latency translates directly into wasted compute cycles. INT gives operators per-hop, per-flow latency attribution that SNMP-based tools simply cannot provide.

2. RoCE v2 relies on the network for lossless or near-lossless delivery. RDMA over Converged Ethernet v2 depends on Priority Flow Control, DCBX negotiation, and Explicit Congestion Notification to avoid packet drops. INT surfaces where PFC pause frames are being triggered, which queues are filling, and where ECN marks are being applied — all in real time, per packet. Without INT, operators are diagnosing RoCE congestion with indirect signals like switch CPU utilisation or PFC counter polling, which is both slow and imprecise.

3. ECMP entropy and hash polarization are silent killers. In a leaf-spine fabric with 32 or 64 ECMP paths, poor hash distribution can funnel elephant flows onto the same spine link while other links sit idle. Path telemetry reveals the actual forwarding path selection per flow, enabling operators to tune hash algorithms or adjust fabric topology before training job throughput degrades.

The Australian market context

Australia’s AI infrastructure buildout is accelerating. Hyperscale cloud providers — AWS in Sydney, Azure in Melbourne and Canberra, Google Cloud in Sydney — have expanded local GPU capacity. Domestic colocation providers and managed service providers are adding GPU-as-a-service offerings. Government agencies under the Australian Signals Directorate’s Essential Eight framework and the Privacy Act review are increasingly requiring network-level auditability.

For Australian buyers evaluating a new AI fabric or upgrading an existing spine-leaf deployment, the observability question is no longer optional. The question is: do you buy a proprietary telemetry stack from your switch vendor, or do you deploy an open, ASIC-agnostic telemetry plane on top of an open NOS?

This is where the SONiC and INT combination becomes a strategic decision, not just a technical one.

How INT works on SONiC-based fabrics

SONiC’s INT implementation leverages the SAI (Switch Abstraction Interface) telemetry APIs. On supported ASICs — including Broadcom Memory-class and Marvell Teralynx families, and NVIDIA Spectrum devices that also support Pure SONiC — the INT sink and source functions are configured through SONiC’s configuration framework. The switch inserts INT metadata at the fabric ingress (source), accumulates it at each hop (transit), and exports it at the egress leaf or at a dedicated monitoring endpoint (sink).

The telemetry data is typically exported via IPFIX or gRPC streaming to an analytics platform. Several open source and commercial collectors can ingest INT data, and the SONiC community has been actively contributing INT-related improvements to the SAI specification.

For path telemetry, the implementation is similar but focuses on path identification rather than latency measurement. Each switch appends its switch ID and egress port to a path stack in the packet header, giving the collector a complete forwarding path record for each flow.

xSONiC data center AI switches ship with Enterprise SONiC that supports INT and path telemetry on qualified hardware. The combination of open NOS, open telemetry instrumentation, and multi-vendor ASIC support means Australian operators are not locked into a single vendor’s management plane to get hop-by-hop visibility.

The packet broker angle

Not every monitoring use case requires INT at the switch level. For passive traffic analysis, security inspection, and compliance logging, network packet brokers aggregate, filter, and replicate traffic to monitoring tools. In AI data center environments, packet brokers sit at the fabric edge to capture inter-tenant traffic, export copies to DLP and IDS appliances, and feed traffic metadata to SIEM platforms.

The relationship between INT and packet brokering is complementary: INT provides the active, per-packet telemetry from the forwarding plane, while packet brokers provide the passive traffic mirroring and tool delivery layer. Australian operators building out AI infrastructure should consider both layers as part of their observability and security architecture.

Buyer checklist: INT readiness for AI fabric evaluation

When evaluating switch hardware and NOS platforms for an AI fabric deployment in Australia, the following INT and path telemetry capabilities should be on the buyer checklist:

Capability	Why it matters	What to verify
INT source/transit/sink support	Enables hop-by-hop latency and congestion attribution	Confirm SAI INT API support on target ASIC
Per-flow path telemetry	Reveals actual ECMP forwarding path per flow	Confirm path ID insertion at each hop
PFC and ECN event correlation	Links INT telemetry to RoCE congestion signals	Verify DCBX and PFC counter export alongside INT
Telemetry export format	Determines compatibility with analytics backend	IPFIX, gRPC, or streaming telemetry support
Multi-vendor ASIC portability	Prevents lock-in to a single silicon vendor	Verify SAI compliance across target switch models
Packet broker integration	Complements INT with passive traffic capture	Confirm mirror/session support and broker interoperability

The analysis does make the source-backed argument that SONiC is a production-hardened open source NOS with growing INT support, that AI fabric traffic patterns demand per-flow visibility that legacy monitoring cannot deliver, and that Australian operators have both a technical and strategic reason to evaluate open telemetry stacks on open NOS platforms.

Next steps for Australian buyers

Organisations planning or upgrading AI data center fabrics in Australia should:

Map their current observability stack against INT and path telemetry capabilities.
Evaluate whether their switch ASIC and NOS combination supports SAI INT APIs.
Assess the analytics backend required to ingest and visualise INT data at scale.
Consider the total cost of telemetry — including the cost of not having per-flow visibility when a training job stalls.

xSONiC publishes solution guides for INT technology, IPTPath telemetry, AI fabric design, and RoCE v2 optimisation. These resources are available at the xSONiC solutions pages and are a starting point for technical evaluation.

Sources Reviewed

SONiC Foundation: https://sonicfoundation.dev/
Supports: input source for finding, recommendation, claim, and evidence review.
SONiC GitHub: https://github.com/sonic-net/SONiC
Supports: input source for finding, recommendation, claim, and evidence review.
Azure SONiC Documentation: https://azure.github.io/SONiC
Supports: input source for finding, recommendation, claim, and evidence review.
Open Compute Networking: https://www.opencompute.org/projects/networking
Supports: input source for finding, recommendation, claim, and evidence review.
Broadcom Ethernet Switching: https://www.broadcom.com/products/ethernet-connectivity/switching
Supports: input source for finding, recommendation, claim, and evidence review.
Marvell Switching: https://www.marvell.com/products/switching.html
Supports: input source for finding, recommendation, claim, and evidence review.
NVIDIA Ethernet Switching: https://www.nvidia.com/en-us/networking/ethernet-switching
Supports: input source for finding, recommendation, claim, and evidence review.
Continue: https://www.nvidia.com/
Supports: input source for finding, recommendation, claim, and evidence review.

Why INT and Path Telemetry Are Becoming Non-Negotiable for AI Data Center Observability in Australia

The observability blind spot inside GPU clusters

What INT actually does inside a packet

Why this matters more in AI fabrics than in general-purpose DCs

The Australian market context

How INT works on SONiC-based fabrics

The packet broker angle

Buyer checklist: INT readiness for AI fabric evaluation

Next steps for Australian buyers

Sources Reviewed

Why INT and Path Telemetry Are Becoming Non-Negotiable for AI Data Center Observability in Australia

The observability blind spot inside GPU clusters

What INT actually does inside a packet

Why this matters more in AI fabrics than in general-purpose DCs

The Australian market context

How INT works on SONiC-based fabrics

The packet broker angle

Buyer checklist: INT readiness for AI fabric evaluation

Next steps for Australian buyers

Related xSONiC Resources

Sources Reviewed