Blog

Why AI Data Centers Need In-Band Network Telemetry -- And Where Open SONiC Fits

INT and path telemetry are becoming essential for AI fabric observability. This analysis examines where SONiC stands on INT support and what it means for Australian data center operators building GPU clusters.

By xSONiC Team · · SONiCopen networkingdata centerAI fabricEthernetautomation

The AI Fabric Observability Gap

As Australian enterprises and colocation providers scale GPU clusters for private LLM inference and model training, a recurring pain point has emerged: traditional SNMP and polling-based monitoring cannot keep pace with the microsecond-level congestion events that degrade RoCE/RDMA traffic in AI back-end fabrics. When a single elephant flow stalls on a spine link, GPU collective operations such as all-reduce can cascade into multi-second job slowdowns that operators struggle to diagnose with conventional tools.

In-band Network Telemetry (INT) and path telemetry address this gap by embedding metadata — hop-by-hop latency, queue depth, egress port utilisation, and buffer occupancy — directly into the data plane packet headers as they traverse each switch. The result is hop-by-hop, per-flow visibility that does not rely on out-of-band polling intervals.

This analysis examines why INT matters for AI data center fabrics specifically, where the SONiC open networking ecosystem currently stands on INT support, and what that means for buyers evaluating open networking in Australian data center markets.

What INT and Path Telemetry Actually Do

In-band Network Telemetry, originally specified through work contributed to the P4 and IETF INT frameworks, allows a source (typically a SmartNIC or DPU) to instruct each switch in the forwarding path to append telemetry metadata to a packet. As the packet traverses spine and leaf switches, each hop stamps information such as switch ID, ingress/egress port, queue depth, and transit latency. A collector or the destination endpoint then extracts this data.

Path telemetry extends the concept with variations suited to different silicon capabilities. Some implementations use mirror-on-drop or per-flow state export rather than in-packet header insertion, which can be more practical on ASICs with limited programmability. The key outcome for AI fabric operators is the same: a real-time, per-packet map of where congestion is occurring and which flows are affected.

For GPU training clusters running RoCE v2 with DCBX-based priority flow control, INT telemetry can pinpoint exactly which leaf-to-spine link is absorbing congestion, whether a misconfigured PFC threshold is causing head-of-line blocking, or whether an optical transceiver link is experiencing intermittent errors. Without INT, operators typically rely on aggregate counters and end-to-end job completion times — which tells them something is wrong but not where.

SONiC’s Position in the INT Landscape

SONiC (Software for Open Networking in the Cloud) is an open-source network operating system under the Linux Foundation that runs on switches from multiple vendors and ASICs. According to the SONiC Foundation, the platform offers a full suite of network functionality including BGP and RDMA, production-hardened in the data centers of some of the largest cloud service providers. Its container-based architecture decouples hardware from software and enables modular feature development.

NVIDIA’s networking portfolio — including Spectrum Ethernet switches — explicitly supports Pure SONiC alongside Cumulus Linux as a network operating system choice. NVIDIA Spectrum-X, the company’s Ethernet platform designed for AI workloads, promotes features such as zero-touch accelerated RoCE and actionable visibility. NVIDIA NetQ is positioned as the operations tool for real-time visibility and troubleshooting.

The SONiC community has been progressively adding INT support through the INT sink/source/transit framework, with ongoing development in the sonic-swss and sonic-int teams. However, the maturity of INT features varies across ASIC backends and SONiC distribution versions. Not all switches running SONiC have identical INT capabilities, and some features may require specific ASIC support (such as programmable pipeline access on certain Broadcom or Marvell silicon).

For buyers evaluating open networking as an alternative to proprietary AI fabric stacks, this variance is a critical point. INT on SONiC is real and advancing, but the question of which specific switch hardware, which SONiC version, and which ASIC supports the INT features needed for production AI fabric observability requires careful validation.

Why This Matters for Australian AI Data Center Builds

Australia’s data center market is expanding rapidly, with major facilities in Sydney, Melbourne, Brisbane, and Perth adding capacity for GPU-dense deployments. Hyperscale colocation providers and enterprise buyers are deploying 400G and 800G spine-leaf fabrics to support AI training and inference workloads.

In this context, observability is not a nice-to-have — it is a capacity planning and SLA enforcement requirement. When a GPU cluster costs millions of dollars in hardware alone, and training jobs can run for days, the ability to detect and resolve fabric congestion in real time directly impacts ROI.

Open networking with SONiC-based switches offers Australian buyers an alternative to closed vendor stacks that bundle telemetry as a premium licensed feature. The value proposition is straightforward: deploy open-source SONiC on multi-vendor switching hardware, use INT or path telemetry for fabric-wide visibility, and avoid per-switch telemetry license costs that scale linearly with fabric size.

However, this value proposition only holds if the INT implementation is production-ready on the chosen hardware. Buyers should evaluate:

  • ASIC-level INT support (source, transit, and sink roles)
  • Collector and analytics tooling integration
  • Compatibility with RoCE v2 and DCBX configurations
  • Maturity of the SONiC version on the target switch platform
  • Path telemetry alternatives where in-packet INT is not supported

These are solvable problems, but they are not automatic. The open networking ecosystem requires more upfront engineering validation than a turnkey proprietary solution.

Proprietary Telemetry Lock-In: A Buyer Concern Worth Naming

One of the less-discussed aspects of AI fabric observability is how deeply telemetry features are tied to proprietary switch software and ASIC ecosystems. When a vendor’s INT implementation only works with that vendor’s NOS, collector, and management platform, the buyer is locked into a closed stack for the lifecycle of the fabric.

For a 128-node GPU cluster with 512 switch ports, this lock-in can represent significant ongoing licensing and support costs. More importantly, it limits the buyer’s ability to mix switch vendors across tiers — a common strategy in large fabrics where different leaf and spine use cases may favor different hardware price-performance points.

SONiC’s multi-vendor architecture is designed to break this pattern. A SONiC-based fabric where INT telemetry is handled through open interfaces — P4-based or API-driven — allows operators to swap switch hardware without losing observability. This is the long-term value proposition, even if the ecosystem is still maturing for some INT features.

For Australian buyers specifically, where the pool of network engineering talent familiar with open networking is growing but still smaller than in US or European markets, the skills investment in SONiC and INT is a factor. The SONiC community, operating under the Linux Foundation, provides documentation, wiki resources, and an active GitHub presence, which supports the skills development pipeline.

What xSONIC Buyers Should Evaluate

For organizations evaluating xSONIC data center AI switches and packet broker solutions for AI fabric deployments, INT and path telemetry readiness should be on the evaluation checklist. Specific questions to raise during the buying process:

  1. Which SONiC version ships on the target switch platform, and what INT source/transit/sink roles does it support?
  2. Does the switch ASIC support in-packet INT header insertion, or is an alternative path telemetry mechanism available?
  3. What collector and analytics integrations are validated — is there a supported path to Grafana, ELK, or a purpose-built INT analytics platform?
  4. How does INT interact with RoCE v2, DCBX, and PFC configurations in the target deployment?
  5. What is the roadmap for INT feature parity across xSONIC’s bare-metal switch and data center AI switch product lines?

These are not questions that a datasheet will answer. They require engagement with xSONIC’s technical team and ideally a proof-of-concept or lab validation before committing to a production fabric. The goal is to ensure that the open networking observability promise is matched by production-ready capability on the specific hardware and software combination being deployed.

The Open Networking Trajectory

The direction of travel is clear. INT and path telemetry are becoming standard expectations for data center fabrics that carry AI workloads. The SONiC ecosystem, backed by a Linux Foundation project with contributions from major cloud providers and silicon vendors, is progressively closing the feature gap with proprietary NOS platforms.

NVIDIA’s commitment to supporting Pure SONiC on Spectrum switches, combined with the company’s own Spectrum-X platform positioning for AI networking, validates the open networking path even from a vendor with a strong proprietary portfolio. The existence of both Cumulus Linux and Pure SONiC as NOS options on the same hardware gives buyers a migration path from proprietary to open as SONiC features mature.

For the Australian market, where data center capacity for AI workloads is being built out now, the timing of INT readiness on SONiC is relevant. Buyers who lock into proprietary telemetry stacks today may face migration costs in three to five years as the open networking ecosystem matures and talent availability shifts. Buyers who invest in SONiC-based fabrics with INT today accept more upfront integration work but position themselves for long-term flexibility.

Neither path is wrong. But the decision should be made with eyes open about where the ecosystem is heading and what the tradeoffs are.

Sources Reviewed