Network and Storage Planning for Private AI Inference

Why Private AI Inference Changes Your Network and Storage Plan

When enterprises move AI inference from public cloud endpoints to private infrastructure, the network and storage conversation shifts dramatically. Public cloud AI services abstract away the fabric. Private deployments do not.

A private AI inference server — whether it runs a large language model (LLM), a retrieval-augmented generation (RAG) pipeline, or a multimodal vision-language model — demands predictable, low-latency connectivity between GPUs, between GPUs and storage, and between the inference server and the application layer. Unlike traditional application servers that tolerate millisecond-level jitter, GPU inference workloads can stall or produce inconsistent token throughput when the network introduces congestion or packet loss during RDMA transfers.

The SONiC Foundation describes SONiC (Software for Open Networking in the Cloud) as an open-source network operating system based on Linux that runs on switches from multiple vendors and ASICs. It offers a full suite of network functionality, including BGP and RDMA, that has been production-hardened in the data centers of some of the largest cloud service providers (sonicfoundation.dev). This matters for private AI deployments because it means the same open NOS powering hyperscaler AI clusters is available to enterprise teams building smaller-scale inference infrastructure.

For Australian enterprises — particularly in financial services, healthcare, mining analytics, and government — data sovereignty and latency requirements often make private AI inference a compliance necessity, not just an optimization. Planning the network and storage stack correctly from day one avoids costly re-architecture six months later.

The AI Inference Data Path: What Your Network Must Carry

A private AI inference deployment has three distinct traffic flows that place different demands on the network:

GPU-to-GPU (backend fabric): When a model spans multiple GPUs — common with 70B+ parameter LLMs — tensor parallelism requires extremely low-latency, lossless communication between GPUs. This traffic uses RDMA over Converged Ethernet v2 (RoCE v2) and is highly sensitive to congestion and packet loss. A single lost RoCE packet can cause a GPU timeout that cascades into inference latency spikes.

Client-to-server (frontend path): Application traffic hitting the inference API endpoint is relatively bursty and latency-sensitive but low-bandwidth compared to the backend flows. This traffic typically rides the same data center fabric but should be isolated via VLANs or VRFs to prevent backend congestion from affecting API response times.

The practical takeaway: your network plan must treat the GPU backend fabric and the storage path as separate design problems with different performance targets. A single flat Layer 2 network is unlikely to serve all three flows well.

Spine-Leaf Architecture with SONiC: Building the AI Fabric

The spine-leaf (Clos) topology has become the reference architecture for AI and data center fabrics, and for good reason. It provides predictable east-west latency, non-blocking bisection bandwidth, and horizontal scalability — all properties that GPU inference clusters need.

SONiC’s container-based architecture, where each network function runs in its own Docker container, supports this design pattern well. As the SONiC GitHub repository notes, this modular design provides better fault isolation, easier debugging and troubleshooting, simplified upgrades and maintenance, and enhanced scalability (github.com/sonic-net/SONiC). For an AI inference fabric where uptime during model serving hours is critical, the ability to upgrade or troubleshoot a single container without restarting the entire switch OS is a meaningful operational advantage.

NVIDIA’s Spectrum-X Ethernet platform illustrates the performance envelope available on SONiC-compatible hardware. The Spectrum-4 SN5000 series is described as purpose-built for AI, connecting cloud-scale GPU compute at speeds up to 800 Gb/s, with RoCE acceleration built into the switching ASIC (nvidia.com/en-us/networking/ethernet-switching). SONiC is listed as a supported NOS alongside Cumulus Linux on these platforms.

For a private AI inference deployment with 4 to 16 GPU servers, a typical spine-leaf design might look like:

Leaf switches: Each GPU server connects to a pair of leaf switches via 100GbE or 200GbE links for the backend fabric. Dual-homing provides redundancy.
Spine switches: Two to four spine switches provide non-blocking east-west connectivity between all leaf pairs.
Frontend network: A separate set of leaf switches (or VLANs on the same leaf pair) handles application and management traffic at 25GbE or 100GbE.

Key SONiC features for this design include BGP EVPN for overlay routing, VXLAN for network virtualization, and RDMA-aware congestion management. The specific ASIC and switch platform selection should be driven by port count, speed requirements, and budget — not by NOS choice, since SONiC decouples the software from the hardware.

[xSONIC product alignment: Data Center AI Switches (/products/datacenter-ai/) for spine and leaf roles, Bare Metal Switches (/products/bare-metal/) for teams evaluating custom NOS deployments. Solution pillars: AI Fabric (/solutions/data-center/ai-fabric/), EVPN-VXLAN (/solutions/data-center/evpn-vxlan-guide/).]

RoCE v2 and Lossless Ethernet: The RDMA Prerequisite

GPU inference workloads that use tensor parallelism across multiple GPUs rely on RDMA (Remote Direct Memory Access) for inter-GPU communication. In Ethernet-based AI fabrics, this means RoCE v2 — RDMA over Converged Ethernet version 2.

RoCE v2 requires the network to deliver lossless or near-lossless behavior. Unlike TCP, which handles packet loss through retransmission, RDMA treats packet loss as a transport error that can stall the entire GPU communication. To achieve this, the network fabric must implement:

Priority Flow Control (PFC): Allows the switch to pause traffic on a per-priority basis rather than dropping packets when a queue fills up.
Data Center Bridging Capability Exchange (DCBX): Enables switches and NICs to negotiate QoS parameters automatically, ensuring consistent priority and flow control configuration across the fabric.
Explicit Congestion Notification (ECN) and Fast CNP (Congestion Notification Packets): Provides early congestion signals to RDMA senders, allowing them to reduce their transmission rate before packets are dropped.

These features are available in SONiC-based switch platforms and are critical for any AI inference fabric that uses multi-GPU model parallelism. Without them, RDMA traffic will experience timeouts and retries that manifest as inconsistent inference latency — a particularly visible problem in interactive LLM applications where users expect smooth token generation.

For Australian enterprises deploying private AI inference, RoCE v2 over a SONiC fabric offers a practical middle ground: the performance characteristics needed for GPU backend communication without the vendor lock-in and premium pricing of proprietary InfiniBand alternatives.

[xSONIC solution alignment: RoCE v2 (/solutions/data-center/roce-v2-guide/), DCBX (/solutions/data-center/dcbx-technology/), Fast CNP (/solutions/data-center/fast-cnp/), GPU Backend Fabric (/solutions/data-center/gpu-backend-fabric/).]

Storage Planning: NVMe SSDs and the Inference I/O Pattern

Private AI inference has specific storage demands that differ from both traditional enterprise applications and AI training workloads:

Model loading: When a new model is loaded onto a GPU server (or swapped between models), the entire checkpoint — often 13 GB to 140 GB depending on quantization — must be read from storage and transferred to GPU memory. This is a sustained sequential read operation where NVMe SSD throughput directly determines cold-start time.

RAG vector database access: Retrieval-augmented generation workloads query vector databases stored on local or network-attached NVMe storage. These are random-read-intensive operations with small I/O sizes, making IOPS and latency more important than sequential throughput.

KV cache offloading: Some inference frameworks support offloading key-value (KV) caches to local NVMe storage when GPU memory is constrained. This is a mixed read/write workload with tight latency requirements.

Enterprise NVMe SSDs in the U.2, M.2, E1.S, and AIC (Add-In Card) PCIe Gen4 form factors address these workloads. For an inference server with 4 to 8 GPU slots, a typical local storage configuration might include:

2-4x U.2 NVMe SSDs in RAID 1 or RAID 10 for model storage and checkpoint loading
1-2x M.2 NVMe SSDs for the OS and inference framework installation
Optional: E1.S SSDs for high-density deployments in compact server chassis

The storage network — connecting inference servers to shared model repositories or distributed vector databases — should be planned separately from the GPU backend fabric. A 100GbE storage network with jumbo frames (9000 MTU) is a common baseline for sustained throughput to shared NVMe storage arrays.

Optical Transceivers: Connecting the AI Inference Fabric

The physical layer of an AI inference fabric relies on optical transceivers and direct-attach cables (DACs) to connect switches to servers and switches to each other. The transceiver selection depends on link distance, speed, and port density:

Link Role	Typical Distance	Speed	Transceiver Type
Leaf to GPU server	1-5 m (same rack)	100GbE / 200GbE	DAC or AOC
Leaf to spine	10-100 m (adjacent racks)	100GbE / 400GbE	QSFP28 / QSFP-DD SR4
Spine to spine (if needed)	10-150 m	400GbE / 800GbE	QSFP-DD / OSFP SR8
Frontend uplinks	10-300 m	25GbE / 100GbE	SFP28 / QSFP28 LR

For short-reach connections within the AI fabric — leaf to server and leaf to spine within the same row or room — SR (short reach) optics and DAC/AOC cables provide the lowest cost and lowest latency. For connections between buildings or across a campus, LR (long reach) or ER (extended reach) optics on single-mode fiber are necessary.

Optical transceiver compatibility with SONiC-based switches depends on the switch platform and ASIC vendor. Most SONiC-compatible switches accept multi-source agreement (MSA) compliant transceivers, which gives buyers the flexibility to source optics from multiple vendors rather than being locked to the switch OEM’s optics.

[xSONIC product alignment: Optical Transceivers (/products/optical-transceiver/) covering SFP28, QSFP28, QSFP-DD, and OSFP form factors.]

Visibility and Observability: Why Packet Brokers Matter for AI Inference

Once an AI inference fabric is in production, network visibility becomes essential for troubleshooting latency issues, detecting congestion, and validating QoS configuration. Traditional approaches — mirroring switch ports to a monitoring tool — work at small scale but become impractical as the fabric grows.

Network packet brokers provide dedicated hardware for traffic aggregation, filtering, replication, and load balancing across monitoring and security tools. In an AI inference context, packet brokers can:

Filter and replicate RoCE v2 traffic for RDMA performance analysis
Aggregate traffic from multiple leaf switches into a centralized monitoring platform
Deduplicate mirrored packets to reduce tool-side processing load
Deliver traffic to security inspection tools without impacting the production fabric

SONiC-based telemetry features — including In-band Network Telemetry (INT) and IPTPath telemetry — complement physical packet brokers by providing hop-by-hop visibility into packet latency, queue depth, and congestion events directly from the switch ASIC. Together, INT telemetry and packet broker infrastructure give operations teams the visibility needed to maintain consistent inference performance.

[xSONIC product alignment: Packet Brokers (/products/packet-broker/) for physical visibility. Solution pillars: INT Telemetry (/solutions/data-center/int-technology/), IPTPath Telemetry (/solutions/data-center/iptpath-telemetry/).]

Sources Reviewed

SONiC Foundation: https://sonicfoundation.dev/
Supports: input source for finding, recommendation, claim, and evidence review.
SONiC GitHub: https://github.com/sonic-net/SONiC
Supports: input source for finding, recommendation, claim, and evidence review.
Azure SONiC Documentation: https://azure.github.io/SONiC
Supports: input source for finding, recommendation, claim, and evidence review.
Open Compute Networking: https://www.opencompute.org/projects/networking
Supports: input source for finding, recommendation, claim, and evidence review.
Broadcom Ethernet Switching: https://www.broadcom.com/products/ethernet-connectivity/switching
Supports: input source for finding, recommendation, claim, and evidence review.
Marvell Switching: https://www.marvell.com/products/switching.html
Supports: input source for finding, recommendation, claim, and evidence review.
NVIDIA Ethernet Switching: https://www.nvidia.com/en-us/networking/ethernet-switching
Supports: input source for finding, recommendation, claim, and evidence review.
Continue: https://www.nvidia.com/
Supports: input source for finding, recommendation, claim, and evidence review.

Network and Storage Planning for Private AI Inference Servers: A Practical Buyer Guide