Blog

Leaf-Spine Data Center Design at 400G and 800G: What Australian Network Teams Need to Know

An original technical guide to leaf-spine architecture at 400G and 800G line rates, covering topology fundamentals, ASIC considerations, SONiC as the NOS layer, and a practical evaluation framework for Australian

By xSONiC Team · · SONiCopen networkingdata centerAI fabricEthernetautomation

Why Leaf-Spine Is the Dominant Data Centre Fabric

Traditional three-tier data centre networks (access, aggregation, core) relied on oversubscription at every hop. That model breaks down under the traffic patterns of modern distributed applications: east-west flows between compute nodes, GPU clusters exchanging RDMA traffic, and storage replication across racks.

Leaf-spine architecture replaces the hierarchy with a two-tier, non-blocking fabric. Every leaf switch connects to every spine switch. Every server or device connects to a leaf. The result is a predictable hop count (always two between any two endpoints), deterministic latency, and horizontal scalability: when you need more bandwidth or port capacity, you add another leaf-spine pair rather than redesigning the network.

This design has become the default for hyperscale clouds and is increasingly standard for enterprise data centres, colocation deployments, and AI training clusters.

The Bandwidth Step-Up: 400G Today, 800G Now

The transition from 100G/200G to 400G spine links was driven by the need to absorb growing east-west traffic without multiplying rack count. At 400G, a single QSFP-DD or OSFP port can replace four 100G ports on the spine tier, reducing cabling, power draw per gigabit, and switch count.

800G raises the bar again. A single 800G OSFP port effectively doubles the bandwidth of a 400G port or can be broken out into multiple 200G or 100G channels for leaf-to-server connectivity. This matters for AI and HPC fabrics where GPU-to-GPU traffic demands low-latency, high-throughput interconnects that saturate 400G links faster than expected.

Decision table for port speed selection:

Factor400G Spine800G Spine
Typical useGeneral enterprise and cloud leaf-spineAI/HPC clusters, hyperscale cloud
Port density (per ASIC)32 ports (e.g. QSFP-DD)64 ports (e.g. OSFP) on latest silicon
Fan-out options4x 100G, 2x 200G8x 100G, 4x 200G, 2x 400G
MaturityProduction-proven at scaleEntering production; check vendor availability
ASIC generations supporting itSpectrum-3 class and aboveSpectrum-4 / Spectrum-6 class

Note: Port counts and break-out options vary by switch model and ASIC vendor. The numbers above are representative of published specifications from at least one vendor family and should be confirmed against the specific platform you evaluate.

How Leaf-Spine Works at 400G and 800G

A leaf-spine fabric at these speeds follows the same logical pattern as a 100G deployment, but the physical design decisions change.

Leaf tier: Each leaf switch sits at the top of a rack (or half-rack) and provides server-facing ports at 25G, 50G, or 100G, plus uplink ports to every spine at 400G or 800G. The leaf ASIC must handle the aggregate bandwidth of all server ports plus any in-rack traffic.

Spine tier: Spine switches have no server-facing ports. Every port faces a leaf. At 400G, a 32-port spine switch can serve 32 leaf switches. At 800G, a 64-port spine switch can serve 64 leaves, effectively doubling fabric scale without adding spine nodes.

Superspine (optional): For very large fabrics (thousands of server ports), a third tier of superspine switches interconnects multiple leaf-spine pods. At 400G/800G, this is common in AI factory designs where GPU clusters span multiple rows or halls.

ECMP and load balancing: All routing in a leaf-spine fabric relies on Equal-Cost Multi-Path (ECMP). BGP is the standard routing protocol for data centre fabrics (as opposed to OSPF or IS-IS) because it scales well, offers fine-grained policy control, and is the protocol that SONiC implementations have production-hardened most extensively.

RDMA over Converged Ethernet (RoCE): For AI and HPC workloads, RoCE v2 traffic requires lossless or near-lossless fabric behaviour. This means Data Center Bridging (DCB) features such as Priority Flow Control (PFC) and Explicit Congestion Notification (ECN) must be configured end-to-end across the leaf-spine fabric. Not all switch ASICs handle RDMA workloads with the same efficiency.

SONiC as the NOS Layer for Leaf-Spine Fabrics

Software for Open Networking in the Cloud (SONiC) is an open-source network operating system maintained under the Linux Foundation. It runs on switches from multiple hardware vendors and multiple ASIC families, which means network teams can choose their switching hardware independently from their software stack.

Key SONiC characteristics relevant to leaf-spine design:

  • Container-based architecture: Each network function (BGP, LLDP, DHCP relay, etc.) runs in its own Docker container. This allows teams to upgrade or troubleshoot individual services without affecting the entire switch.

  • Switch Abstraction Interface (SAI): SAI decouples the SONiC software from the underlying ASIC. Hardware vendors implement SAI for their silicon, and SONiC runs on top. This is what enables multi-vendor hardware choice.

  • BGP and RDMA support: SONiC offers production-hardened BGP (essential for leaf-spine routing) and RDMA support (essential for AI fabric traffic). These capabilities were developed and tested in the data centres of large cloud service providers.

  • JSON-based configuration: SONiC uses a centralized ConfigDB in JSON format. This makes configuration programmable and suitable for automation pipelines using Ansible, Terraform, or custom tooling.

  • Active community and ecosystem: The SONiC project on GitHub has attracted contributions from major chip vendors and networking companies, and the supported devices list continues to grow.

For Australian organisations, SONiC’s multi-vendor support is a practical advantage: it reduces vendor lock-in and can simplify procurement when local distributor stock varies across brands.

ASIC Considerations: What Drives 400G/800G Switch Selection

The switch ASIC is the heart of any leaf or spine switch. At 400G and 800G, the ASIC determines:

  • Maximum throughput per chip: Measured in Terabits per second (Tb/s). A spine switch at 800G needs an ASIC capable of at least 51.2 Tb/s to deliver 64 ports at full line rate.
  • Packet buffer size: Deep buffers help absorb microbursts, which are common in AI training traffic patterns. Insufficient buffer depth leads to packet drops and retransmissions.
  • Flow table scale: Large data centres need hundreds of thousands of routes, ACLs, and flow counters. ASICs with limited table sizes force architectural compromises.
  • RDMA offload capability: Hardware-level support for RoCE v2, PFC, and ECN reduces CPU overhead on the switch and ensures consistent latency.
  • Power consumption per port: At 800G, power per port becomes a real operational consideration, especially in Australian colocation facilities where power costs are significant.

Published specifications from one major vendor show the following ASIC-to-product mapping (representative, not exhaustive):

ASIC GenerationMax Port SpeedExample SwitchMax ThroughputTypical Role
Spectrum-2 class200 Gb/sSN3000 seriesUp to 6.4 Tb/sLeaf (general purpose)
Spectrum-3 class400 Gb/sSN4000 seriesUp to 12.8 Tb/sLeaf or small spine
Spectrum-4 class800 Gb/sSN5000 seriesUp to 51.2 Tb/sSpine (AI and cloud)
Spectrum-6 class800 Gb/sSN6000 seriesUp to 409.6 Tb/s (multi-chip)Superspine / AI factory

400G vs 800G: Choosing the Right Speed for Your Fabric

Not every data centre needs 800G spines today. The right choice depends on your workload, growth trajectory, and budget.

Choose 400G spine links when:

  • You are building or refreshing a general-purpose enterprise or cloud leaf-spine fabric
  • Server NICs are predominantly 25G or 50G
  • Your traffic growth is steady but not explosive
  • You want mature, widely available optics and switches

Choose 800G spine links when:

  • You are deploying GPU clusters for AI training or inference at scale
  • Server NICs are 100G or 200G (common with NVIDIA ConnectX or BlueField adapters)
  • You need to maximise port density at the spine tier to reduce device count
  • Your data centre power and cooling budget supports the higher per-port power envelope

Hybrid approach: Many teams deploy 400G spines today with a clear upgrade path to 800G. Because leaf-spine is inherently modular, you can swap spine switches for higher-speed models without redesigning the entire fabric, provided your leaf switches and cabling support the new speeds.

Practical Design Checklist for Australian Leaf-Spine Deployments

Use this checklist when planning a 400G or 800G leaf-spine fabric:

  1. Define your workload profile. General cloud, AI training, HPC, or mixed? This determines RDMA requirements, buffer depth needs, and port speed targets.

  2. Count your server ports. Total server count determines leaf switch count. Each leaf covers one rack (typically 24-48 servers at 25G/50G or 8-16 at 100G/200G).

  3. Calculate spine count. Every leaf needs one uplink to every spine. If your leaf has 8 x 400G uplinks, you need 8 spines. If your leaf has 8 x 800G uplinks, the same logic applies.

  4. Verify ASIC table sizes. Confirm that your chosen switch platform supports enough routes, ACLs, and flow counters for your projected scale.

  5. Plan your optics and cabling budget. At 400G and 800G, optics can represent a significant portion of total fabric cost. Budget for both initial deployment and spares.

  6. Select your NOS. SONiC is a strong candidate for multi-vendor flexibility. Proprietary NOS options (Cumulus Linux, vendor-specific OS) may offer features or support tiers that suit your operational model. Evaluate based on your team’s Linux skills, automation requirements, and vendor support preferences.

  7. Design for power and cooling. Australian data centres vary widely in available power per rack. Confirm that your chosen switch form factor and optics power budget fit within your facility’s constraints.

  8. Test before you deploy. Use digital twin or network simulation tools (where available) to validate your fabric design, automation scripts, and failure scenarios before racking hardware.

Summary

Leaf-spine architecture at 400G and 800G is the foundation for modern data centre networking. 400G is production-ready and widely available. 800G is entering production and is the right choice for AI-scale fabrics. SONiC provides an open, multi-vendor NOS layer that runs on switches from multiple hardware vendors and ASIC families. The right speed and platform choice depends on your workload, scale, and operational requirements.

Sources Reviewed