XS-DC-64X800-AI-G1
Data Center AI64-port 800G AI fabric switch for large-scale GPU clusters, HPC backbones, and ultra-high-throughput data center networks.
- 51.2Tbps
- 42,000Mpps
Data Center Solution
Size the backend network around collective communication, not average utilization.
GPU backend networks carry the east-west traffic created by distributed AI training. These networks are different from general data center networks because application performance depends on synchronized communication phases such as all-reduce, parameter exchange, checkpointing, and storage access.
An xSONiC backend fabric should be designed around bandwidth symmetry, low tail latency, congestion behavior, and operational visibility.
| Workload Pattern | Network Impact | Design Response |
|---|---|---|
| All-reduce | Many GPUs exchange data in synchronized phases. | Keep oversubscription low and ECMP behavior predictable. |
| Parameter exchange | Repeated east-west bursts. | Provide headroom and monitor queue pressure. |
| Checkpointing | Large periodic writes to storage. | Separate or carefully engineer storage paths. |
| Failure recovery | Traffic shifts after link or device failure. | Validate convergence and remaining bandwidth under failure. |
| Role | Function | xSONiC Platform Fit |
|---|---|---|
| Backend leaf | Connects GPU servers or accelerator nodes. | 400G/800G ports for high-density server attachment. |
| Backend spine | Provides non-blocking east-west capacity. | 400G/800G spine platforms with high radix. |
| Storage leaf | Connects high-performance storage targets. | 100G/200G/400G depending on storage tier. |
| Frontend boundary | Connects management, user, or service networks. | 100G/200G platforms for controlled separation. |
GPU servers GPU servers GPU servers
| | |
v v v
Backend leaves -- Backend spines -- Backend leaves
| | |
+------ storage / checkpoint fabric boundary ------+
For large clusters, separate backend, frontend, and storage networks can reduce operational risk. Smaller clusters may converge some roles, but the traffic classes and failure domains should still be designed explicitly.
| Control | Purpose | Validation |
|---|---|---|
| PFC | Protect selected RDMA priorities from loss. | Confirm pause is limited to intended priorities. |
| ECN | Mark congestion before queues overflow. | Verify sender response under incast and all-reduce. |
| ETS | Allocate bandwidth across traffic classes. | Confirm storage or management traffic does not starve backend traffic. |
| DCBX | Exchange DCB parameters with adjacent devices. | Check negotiated state on server-facing links. |
| Test | What It Proves |
|---|---|
| All-reduce stress | Backend fabric can handle synchronized GPU communication. |
| Incast test | Queue and congestion controls behave under fan-in. |
| Link failure | Remaining paths can absorb traffic without severe job impact. |
| Storage checkpoint | Storage traffic does not destabilize backend communication. |
| Telemetry correlation | Operators can connect application slowdown to network state. |
Use 800G xSONiC platforms for high-radix AI backend fabrics, 400G platforms for spine or high-density leaf roles, and 100G/200G systems for frontend, storage, or staged migration layers. The exact mix depends on GPU generation, NIC speed, cluster size, and failure-domain design.
Related Products
Use these related platforms as a starting point for sizing, comparison, and follow-up discussion.
64-port 800G AI fabric switch for large-scale GPU clusters, HPC backbones, and ultra-high-throughput data center networks.
32-port 400G spine/core switch for high-capacity data center fabrics and AI-ready backbones.
64-port 200G leaf/spine switch for high-bandwidth storage, compute, and scale-out data center fabrics.
Use the related products below to continue comparing platforms, or open a conversation if you need help mapping the solution to your environment.