Introduction: The Visibility Imperative for AI Data Centers
Australian enterprises and hyperscale operators are investing heavily in AI infrastructure. Behind every training run, every inference query, and every distributed model update sits an Ethernet fabric - a high-performance network of switches, routers, and links that must operate with minimal latency and zero tolerance for silent failures.
Network visibility - the ability to observe, capture, and analyse traffic flowing through every link, port, and device in the fabric - is the difference between proactive performance management and costly, reactive troubleshooting. Yet as data center architectures evolve to support AI workloads, visibility strategies designed for traditional workloads often leave critical blind spots.
This draft explores the foundational concepts that make visibility essential, examines where gaps emerge in AI-optimised Ethernet fabrics, and positions xsonic’s role in helping Australian organisations maintain operational control.
Network Fundamentals: The Building Blocks of a Data Center Fabric
A computer network is a collection of interconnected devices that communicate with each other to exchange data and resources. In the data center context, these devices include servers, switches, routers, firewalls, and storage systems connected through wired media such as optical fibre and copper cabling.
Key network devices in a data center fabric:
- Switches connect devices within the same network and manage internal data communication. They send data only to the intended device, improving network efficiency and performance. Within a data center, switches form the backbone of the Ethernet fabric.
- Routers function as networking devices that connect multiple networks and direct data between them, determining the best path for data packets using IP addresses.
- Firewalls monitor and control network traffic, blocking unauthorised access and protecting networks from cyber threats.
Network architecture in modern data centers commonly follows a spine-and-leaf topology, which is referenced in industry certification frameworks as a key design pattern alongside mesh, star, and three-tier topologies. Spine-and-leaf architectures provide predictable latency and east-west traffic capacity - both critical for AI workloads that involve massive parallel data movement between GPU clusters.
Devices communicate using protocols - agreed-upon rules for how data is formatted, transmitted, received, and acknowledged. The TCP/IP protocol suite is the foundation of all modern networking, defining addressing, identification, and routing specifications. Each device is identified by a unique IP address, while network interface controllers (NICs) carry unique MAC addresses for local identification.
Data is broken into small packets for transmission across the network. These packets consist of control information (source and destination addresses, error detection codes) and user data (payload). Packet-switched networks allow bandwidth to be shared efficiently among multiple users and flows - a fundamental property that enables the multiplexing AI traffic demands.
What Is Network Visibility and Why Does It Matter?
Network visibility refers to the ability to see and understand all traffic traversing a network. In practical terms, this means having access to:
- Traffic flow data - understanding who is talking to whom, at what volume, and how often
- Packet-level detail - the ability to capture and inspect individual packets for deep analysis
- Device-level telemetry - real-time health, performance, and configuration data from switches, routers, and other infrastructure
- Baseline metrics - historical performance data that enables detection of anomalies and degradation
Industry-recognised monitoring approaches include:
- SNMP (Simple Network Management Protocol) - polling device status and counters
- Flow data collection - aggregating traffic summaries (e.g., NetFlow, sFlow, IPFIX)
- Packet capture - recording raw packet data for forensic and troubleshooting analysis
- Port mirroring - copying traffic from one or more switch ports to a monitoring interface
- Log aggregation - centralising device logs for correlation and analysis
- API integration - programmatically collecting telemetry from network infrastructure
Without comprehensive visibility, network teams operate blind. They cannot identify bottlenecks, diagnose intermittent failures, detect security threats, or optimise traffic flows. In an AI data center, where a single training job may generate terabits per second of east-west traffic across hundreds of GPU nodes, the consequences of limited visibility include:
- Undetected packet loss causing job failures or degraded model accuracy
- Inability to pinpoint congestion points in the fabric
- Slower mean time to repair (MTTR) during outages
- Incomplete security posture across the east-west traffic corridor
The AI Data Center: A Different Kind of Network Challenge
AI data centers place fundamentally different demands on the network compared to traditional enterprise or cloud workloads. Understanding these differences is essential for designing an effective visibility strategy.
East-west traffic dominance: In traditional data centers, traffic patterns are largely north-south (client to server). In AI data centres, the dominant traffic is east-west - server-to-server communication between GPU nodes during distributed training, parameter synchronisation, and collective operations. This means internal fabric links carry the majority of load, and visibility must extend deep into the fabric, not just at the perimeter.
Extreme throughput requirements: Ethernet has scaled dramatically since its origins. What began as a 2.94 Mbit/s protocol at Xerox PARC in 1973 has evolved through 10 Mbit/s, 100 Mbit/s, 1 Gbit/s, and beyond - with speeds up to 800 Gbit/s standardised by IEEE as of 2024. AI data centers operate at these leading-edge speeds, meaning monitoring infrastructure must be capable of handling high-bandwidth traffic without introducing latency or data loss.
Sensitivity to packet loss and latency: AI training jobs are synchronised across many nodes. Even small amounts of packet loss or jitter can cause timeouts, retransmissions, and degraded training throughput. Network troubleshooting must detect these micro-events in real time.
Convergence of compute and network: In AI clusters, the network is not a passive connector - it is an active participant in the computation pipeline. Network performance directly impacts GPU utilisation, training time, and model quality. Visibility into network behaviour is therefore visibility into compute performance.
Common troubleshooting challenges in high-performance fabrics:
- Switching issues (spanning tree, VLAN assignment, ACL misconfigurations)
- Routing issues (routing table errors, incorrect default routes)
- Performance degradation (congestion, latency, packet loss, wireless interference where applicable)
- Cabling and physical interface issues (signal degradation, transceiver mismatch)
For Australian organisations, these challenges are compounded by the need to maintain data sovereignty, comply with the Australian Privacy Act, and support remote or distributed data center sites across geographically dispersed locations.
Visibility Architecture: Where to Instrument an AI Ethernet Fabric
Effective network visibility in an AI data center requires instrumentation at multiple points across the fabric. Based on established network architecture principles, the following deployment points are recommended:
1. Spine Layer TAPs and Aggregation The spine layer interconnects all leaf switches and carries the bulk of east-west traffic. Deploying network TAPs (Test Access Points) or leveraging switch-level port mirroring at the spine layer provides a centralised view of inter-pod traffic flows.
2. Leaf Layer Access Leaf switches connect directly to server NICs and GPU nodes. Monitoring at the leaf layer enables per-host traffic analysis, which is critical for identifying which specific GPU node or application is generating anomalous traffic.
3. Management and Storage Networks AI data centers often have separate management and storage networks (e.g., NVMe-over-Fabrics or iSCSI). Visibility into these networks ensures that storage I/O bottlenecks are not mistaken for network fabric issues.
4. Network Services Layer DNS, DHCP, NTP, and load balancing services underpin fabric operation. Monitoring these services ensures that foundational network functions are not the source of application-level failures.
5. Security Boundary Monitoring While east-west traffic is the primary focus, perimeter visibility remains essential. Firewalls, intrusion detection/prevention systems (IDS/IPS), and access control lists (ACLs) must be monitored for both performance and security events.
Aggregation and Analytics Raw telemetry from all instrumentation points should flow to a centralised analytics platform where:
- Flow data is correlated with packet captures for root cause analysis
- Baseline metrics are compared against real-time observations
- Alerts are triggered when thresholds are breached
- Historical data supports capacity planning and trend analysis
This layered approach ensures that no traffic segment goes unmonitored, while avoiding the prohibitive cost and complexity of capturing every packet at every point at all times.
Troubleshooting AI Fabric Issues: A Methodology
When performance degrades in an AI data center Ethernet fabric, structured troubleshooting is essential. The industry-recognised methodology follows these steps:
- Identify the problem - Gather information from users, monitoring systems, and device logs to characterise the issue. Is it latency? Packet loss? A specific flow or application?
- Establish a theory - Based on the symptoms, hypothesise potential root causes. Consider the OSI model layers: physical (cabling, optics), data link (switching, VLANs), network (routing, IP addressing), and transport (TCP behaviour).
- Test the theory - Use packet captures, flow analysis, device counters, and cable/physical testing to validate or eliminate hypotheses.
- Plan and implement a solution - Once the root cause is confirmed, plan the remediation. This may involve configuration changes, hardware replacement, or traffic engineering.
- Verify functionality - Confirm that the fix resolves the issue without introducing new problems.
- Document findings - Record the incident, root cause, and resolution for future reference and knowledge sharing.
Common AI fabric issues that benefit from visibility tooling:
- Microbursts causing momentary congestion and packet drops
- Asymmetric routing causing flow imbalance across spine links
- MTU mismatches (jumbo frame configuration errors) fragmenting large AI data transfers
- Spanning tree or fabric protocol misconfigurations creating suboptimal paths
- Transceiver degradation in high-speed optical links
- Access control list (ACL) misconfigurations blocking legitimate inter-node traffic
Related xSONiC Resources
Sources Reviewed
- What Is a Network ? - Computer Hope: https://www.computerhope.com/jargon/n/network.htm
- Supports: input source for finding, recommendation, claim, and evidence review.
- Basics of Computer Networking - GeeksforGeeks: https://www.geeksforgeeks.org/computer-networks/basics-computer-networking
- Supports: input source for finding, recommendation, claim, and evidence review.
- Network+ (Plus) Certification | CompTIA: https://www.comptia.org/en-us/certifications/network
- Supports: input source for finding, recommendation, claim, and evidence review.
- Computer network - Wikipedia: https://en.wikipedia.org/wiki/Computer_network
- Supports: input source for finding, recommendation, claim, and evidence review.
- Computer Network Tutorial - GeeksforGeeks: https://www.geeksforgeeks.org/computer-networks/computer-network-tutorials
- Supports: input source for finding, recommendation, claim, and evidence review.
- Network (1976 film ) - Wikipedia: https://en.wikipedia.org/wiki/Network_(1976_film)
- Supports: input source for finding, recommendation, claim, and evidence review.
- What is a Network ? | Definition, Features & Types!: https://www.sysnettechsolutions.com/en/what-is-network
- Supports: input source for finding, recommendation, claim, and evidence review.
- Part 1 - Introduction to Networking - Network Direction: https://networkdirection.net/study-notes/network-fundamentals/introduction-to-networking
- Supports: input source for finding, recommendation, claim, and evidence review.