GPU Backend Fabric Design Guide for AI Compute Network

Preface

AI clusters involve three types of networks: Frontend Fabric, GPU Backend Fabric, and Storage Backend Fabric.

Frontend Fabric: used to connect to the Internet or storage systems for loading training data.
GPU Backend Fabric (Compute Network) : supports GPU-to-GPU communication, provides lossless connectivity and enables cluster scaling. It is the core carrier for training data interaction between GPU nodes.
Storage Backend Fabric: handles massive data storage, retrieval, and management between GPUs and high-performance storage.

This guide focuses on the design of 400G AI intelligent computing GPU backend networks at different scales. Using Asterfusion high-density 400G/800G data center switches as hardware carrier, the solutions implements Clos topology based on Rail-only and Rail-optimized architectures to provide standardized deployment guides.

Target Audience

Intended for solution planners, designers, and on-site implementation engineers who are familiar with:

Asterfusion data center switches
RoCE, PFC, ECN, and related technologies

Overview

The rapid evolution of AI/ML (Artificial Intelligence/Machine Learning) applications has driven a continuous surge in demand for large-scale clusters. AI training is a network-intensive workload where GPU nodes interact with massive gradient data and model parameters at high frequencies. This drives the need for a network infrastructure defined by high bandwidth, low latency, and interference resistance.

Traditional general-purpose data center networks struggle to adapt to the traffic characteristics of AI training, which are dominated by “elephant flows” and low entropy. This often leads to bandwidth bottlenecks, transmission congestion, and latency jitter, failing to meet the rigorous requirements of AI training. As the “communication backbone” of the AI cluster, the backend network directly determines the efficiency of GPU compute release. Therefore, an efficient cluster networking solution is urgently needed to satisfy low-latency, high-throughput inter-node communication.

2. AI GPU Backend Network Architecture

2.1 Rail-Only Architecture

Leaf nodes connected to GPUs with the same index across different servers are defined as a Rail plane. That is, Rail N achieves interconnection for all #N GPUs via the N-th Leaf switch. As shown in the figure below, the GPUs on each server are numbered 0–7, corresponding to Rail 1–Rail 8. Intra-rail transmission occurs when the source and destination GPUs’ corresponding NICs are connected to the same Leaf switch. LLM (Large Language Model) training optimizes traffic distribution through hybrid parallelism strategies (Data, Tensor, and Pipeline parallelism), concentrating most traffic within nodes and within the same rail.

The Rail-only architecture adopts a single-tier network design, physically partitioning the entire cluster network into 8 independent rails. Communication between GPUs of different nodes is intra-rail, achieving “single-hop” connectivity.

Figure-1-gpu-backend-fabric-design-Rail-only Architecture

Compared to traditional Clos architectures, the Rail-only design eliminates the Spine layer. By reducing network tiers, it saves on the number of switches and optical modules, thereby reducing hardware costs. It is a cost-effective, high-performance architecture tailored for AI large model training in small-scale compute clusters.

2.2 Rail-Optimized Architecture

Building on the Rail concept, a basic building block consisting of a set of Rails is regarded as a Group, which includes several Leaf switches and GPU servers. As the cluster scale increases, expansion is achieved by horizontally stacking multiple Groups.

The compute network can be visualized as a railway system: compute nodes are “stations” loaded with computing power; Rails are “exclusive rail lines” connecting the same-numbered GPUs at each station to ensure high-speed direct access; and Groups are “standard platform” units integrating multiple tracks and their supporting switches. Through this modular stacking, an intelligent computing center can scale horizontally like building blocks, ensuring both ultra-fast intra-rail communication and efficient interconnection for 10,000-GPU clusters.

Figure-2-gpu-backend-fabric-design-Rail-optimized Architecture

As shown above, the key design of the Rail-optimized architecture is to connect the same-indexed NICs of every server to the same Leaf switch, ensuring that multi-node GPU communication completes in the fewest possible hops. In this design, communication between GPU nodes can utilize internal NVSwitch[1] paths, requiring only one network hop to reach the destination without crossing multiple switches, thus avoiding additional latency. The details are as follows:

Intra-server: 8 GPUs connect to the NVSwitch via the NVLink bus, achieving low-latency intra-server communication and reducing Scale-Out network transmission pressure.
Server-to-Leaf: All servers follow a uniform cabling rule: NICs are connected to multiple Leaf switches according to the “NIC1-Leaf1, NIC2-Leaf2…”.
Network Layer: Leaf and Spine switches are fully meshed in a 2-tier Clos architecture.

A key design factor in multi-stage Clos architectures is the Oversubscription Ratio. This is the ratio of total downlink bandwidth (Leaf nodes to GPU servers) to total uplink bandwidth (Leaf nodes to Spine nodes), as shown below. If the ratio is greater than 1:1, the fabric may lack sufficient capacity to handle inter-GPU traffic when downlink traffic reaches line rate, potentially causing congestion or packet loss.

Figure-3-gpu-backend-fabric-design-Oversubscription Ratio in Rail-optimized Architecture

In short, a smaller oversubscription ratio leads to non-blocking communication but higher costs, while a larger ratio reduces costs but increases congestion risk. In high-performance AI networks, a 1:1 non-blocking design is generally recommended.

2.3 Traffic Path Analysis

The intra-server and intra-rail communication paths are similar for both architectures. Taking the Rail-optimized architecture as an example, the following analyzes inter-GPU communication paths in different scenarios:

Intra-server Communication

Intra-server Communication completed via NVSwitch without passing through the external network.

Figure-4-gpu-backend-fabric-design-Intra-server Communication

Intra-rail Communication

Intra-rail Communication forwarded through a single Leaf switch.

Figure-5-gpu-backend-fabric-design-Intra-rail Communication

Inter-rail (without PXN) and Cross-group Communication

Inter-rail communication is routed through the Spine layer. Similarly, inter-group communication traverses the Spine fabric to reach its destination.

Figure-6-gpu-backend-fabric-design-Inter-rail (without PXN) and Inter-group Communication

Inter-rail (with PXN) Communication

With PXN[2] technology, transmission is completed in a single hop without crossing the Spine.

Figure-7-gpu-backend-fabric-design-Inter-rail (with PXN) Communication

3. Technologies Supporting Lossless Networking

3.1 DCQCN Technology

RDMA (Remote Direct Memory Access) is widely used in HPC, AI training, and storage. Originally implemented on InfiniBand, it evolved into iWARP and RoCE (RDMA over Converged Ethernet) for Ethernet transport.

RoCEv2 utilizes UDP for transport, which necessitates end-to-end congestion control via PFC (Priority Flow Control) and ECN (Explicit Congestion Notification) to guarantee lossless performance. A PFC-only strategy risks unnecessary head-of-line blocking by halting traffic too aggressively. while a standalone ECN approach may suffer from reaction-time latency, potentially leading to buffer overflows and packet loss. Consequently，a unified congestion control strategy is required to balance responsiveness with stability.

DCQCN (Data Center Quantized Congestion Notification) serves as a hybrid congestion control algorithm designed to balance throughput and latency. It triggers ECN during the early congestion to proactively throttle the NIC’s transmission rate. Should congestion intensify, PFC acts as a fail-safe to prevent buffer overflows by exerting backpressure hop-by-hop.

The DCQCN operational logic follows a structured hierarchy:

ECN First (Proactive Intervention): As egress queues begin to accumulate and breach WRED thresholds, the switch marks packets (CE bits). Upon receiving these marked packets, the destination node generates CNPs (Congestion Notification Packets) directed back to the sender, which then smoothly scales down its injection rate to alleviate pressure without halting traffic.
PFC Second (Reactive Safeguard): If congestion persists and buffer occupancy hits the xOFF threshold, the switch issues a PAUSE frame upstream. This temporarily halts transmission for the affected queue, ensuring zero packet loss.
Flow Recovery: Once buffer levels recede below the xON threshold, a RESUME frame is sent to notify the upstream device to resume the transmission.

To streamline the complexities of lossless Ethernet, Asterfusion has introduced the Easy RoCE capability in AsterNOS. This feature automates optimized parameter generation and abstracts intricate configurations into business-level operations, significantly enhancing cluster maintainability.

3.2 Load Balancing Technology

ECMP (Equal-Cost Multi-Path) per-flow load balancing is the most widely used routing strategy in data center networks. It assigns packets to several paths by hashing fields, such as the IP 5-tuple. This approach is known as static load balancing.

However, per-flow hashing struggles with uniform distribution when traffic lacks entropy. The impact is severe during “elephant flows”, which overwhelm specific member links and trigger packet loss.

AI workloads further challenge this model. Deep learning relies on collective communication (e.g., All-Reduce, All-Gather, and Broadcast) that generates massive, bursty traffic reaching Terabits per second (Tbps). These operations are subject to the “straggler effect” — where congestion on a single link bottlenecks the entire training job. This makes traditional ECMP unfit for RoCEv2-based AI backend fabrics.

To address this, the following solutions are introduced:

3.2.1 Adaptive Routing and Switching (ARS)

ARS is a flowlet-based load balancing technology. Leveraging hardware ALB (Auto-Load-Balancing)[3] capabilities, ARS achieves near per-packet equilibrium while mitigating packet reordering. The technology partitions a flow into a series of flowlets based on gap time. By sensing real-time link quality—such as bandwidth utilization and queue depth—ARS dynamically assigns flowlets to the most idle paths, maximizing overall fabric throughput.

3.2.2 Intelligent Routing

Intelligent routing provides both dynamic and static mechanisms.

Dynamic Intelligent Routing: This strategy evaluates path quality based on bandwidth usage, queue occupancy, and forwarding latency. Bandwidth and queue statistics are pulled from hardware registers at millisecond-precision, while latency is monitored via INT (In-band Network Telemetry) at nanosecond-resolution. Switches exchange this real-time telemetry via BGP extensions and utilize dynamic WCMP (Weighted Cost Multipath) to steer traffic toward the optimal path, proactively eliminating bottlenecks.
Static Intelligent Routing: Designed for scenarios requiring high path stability, this method uses PBR (Policy-Based Routing) to enforce deterministic forwarding. By binding specific GPU traffic to dedicated physical paths (Leaf-to-Spine), it ensures a strict 1:1 non-blocking oversubscription for fixed traffic models.

3.2.3 Packet Spraying

Packet Spraying [4]is a per-packet load balancing technique that distributes packets uniformly across all available member links to prevent any single-path congestion. It supports two primary algorithms:

Random: Disperses packets across members using a randomized distribution.
Round Robin: Sequences packets across members in a cyclic, equal-weight manner.

While packet spraying theoretically maximizes network utilization, it introduces the challenge of packet reordering due to varying link latencies. Thus, this technology requires robust hardware support, specifically high-performance NICs capable of sophisticated out-of-order reassembly at the endpoint.

4. Building A 400Gbps GPU Backend Fabric for AI Compute Network

Based on hardware cost and scalability, the following design recommendations are provided:

Table 1: Solution Design by GPU Cluster Scale

GPU Cluster Scale	Design Recommendation
32–256 GPUs	Using CX732Q-N as Leaf nodes in a single-tier Clos Rail-only architecture, supporting up to 256 GPUs.
256–1024 GPUs	Using CX864E-N as Leaf nodes in a single-tier Clos Rail-only architecture, supporting up to 1024 GPUs.
1024–2048 GPUs	Using CX732Q-N as Leaf nodes and CX732Q-N or CX864E-N as Spine nodes to build a 2-tier Clos Fabric with Rail-optimized architecture, supporting up to 2048 GPUs. At least 2 Spine nodes are recommended for redundancy.
2048–8192 GPUs	Using CX864E-N as both Leaf and Spine nodes to build a 2-tier Clos Fabric with Rail-optimized architecture, supporting up to 8192 GPUs.

4.1 Small-Scale Cluster Design

4.1.1 Standardized Networking Solution

Figure-8-gpu-backend-fabric-design-Standardized 400G AI Backend Network for Small-Scale Clusters

The figure above illustrates a Rail-only architecture for a 400G AI backend network consisting of 32 compute nodes (256 GPUs) with 8 CX732Q-N switches deployed as Leaf nodes.The key design principles are as follows:

Each GPU connects to a dedicated NIC; NICs follow the “NIC N to Leaf N” rule. Independent subnets per Rail.
Single-tier Clos architecture.
Easy RoCE enabled on Leaf switches.

4.1.2 Hardware Selection

For small-scale 400Gbps RoCEv2 fabrics, Asterfusion CX864E-N or CX732Q-N switches are recommended. Taking the NVIDIA DGX H100 server (equipped with 8 GPUs) as a baseline, the maximum capacity for different models is summarized below:

Table 2: Max Capacity per Model (Rail-only Architecture)

Model	Max GPUs per Switch	Max GPUs (8 Switches)	Max Servers (8 Switches)
CX732Q-N	32	256	32
CX864E-N	128	1024	128

Note: CX864E-N provides 64 x 800G ports, which can be split into 128 x 400G ports.

Example: Building a 512-GPU Cluster.

To build a cluster with 64 H100 servers (512 GPUs) using CX864E-N as Leaf nodes:

Number of Leaf Nodes Required = 512 / 128 = 4
Scalability Limit (Leafs) = 8 (matching the 8 GPUs per server)
Scalability Limit (GPUs) = 8 * 128 = 1024

Node Requirements and Scalability Summary:

Number of Leaf Nodes = Total GPUs / Max GPUs per switch.
Maximum Scalability (Leafs) = Number of GPUs per server.
Maximum Scalability (Total GPUs) = GPUs per server * Max GPUs per switch.

4.2 Medium-to-Large Scale Cluster Design

4.2.1 Standardized Networking Solution

Figure-9-gpu-backend-fabric-design-Standardized 400G AI Backend Network for Medium-to-Large Clusters

The figure above depicts a Rail-optimized architecture for 128 compute nodes (1024 GPUs). It employs 24 CX864E-N switches (8 Spines, 16 Leafs) organized into two Groups, with 8 Leaf nodes per Group. Key design principles include:

Each GPU connects to a dedicated NIC; NICs follow the “NIC N to Leaf N” rule. Independent subnets per Rail.
2-Tier Clos Fabric: Leaf and Spine switches are fully meshed. Leveraging IPv6 Link-Local, unnumbered BGP neighbors are established to exchange Rail subnet routes, eliminating the need for IP planning on interconnect interfaces.
1:1 Oversubscription: To ensure non-blocking transport, the oversubscription ratio on Leaf switches is strictly maintained at 1:1.
Unified Lossless Fabric: Easy RoCE and advanced load balancing features are enabled on both Leaf and Spine nodes.

4.2.2 Hardware Selection

For these fabrics, we recommend CX864E-N and CX732Q-N due to their ultra-low latency. The CX864E-N offers end-to-end latency as low as 560ns, while the CX732Q-N reaches 500ns. This ensures intra-rail latency remains around 600ns and Inter-rail (3-hop) latency stays under 2μs.

In a Rail-optimized design, the number of Leaf nodes per Group matches the number of GPUs per server (Rails). For H100 servers (8 GPUs), each Group contains 8 Leaf nodes. To maintain a 1:1 oversubscription, half of the Leaf’s ports connect to GPUs and half to Spines.

Table 3: Maximum Capacity per Group (Rail-optimized Architecture)

Leaf Model	Available 400G Ports	Max GPUs/Servers per Group
CX732Q-N	32	128 / 16
CX864E-N	128	512 / 64

Spine Node Calculation: The number of Spine nodes is determined by the port density (radix) of the Leaf nodes. If Leaf and Spine switches provide M and N ports respectively, the required number of Spines = (Total Leafs * M / 2) / N. If Leaf and Spine use identical models, the Spine count = Total Leafs / 2.

Example:

Building a 4096-GPU Cluster To build a cluster with 512 H100 servers (totaling 4096 GPUs) using CX864E-N for both Leaf and Spine layers, the calculation is as follows:

Leaf nodes per Group = 8
Max servers per Group = 128 / 2 = 64
Max GPUs per Group = 64 * 8 = 512
Number of Groups required = 4096 / 512 = 8
Total Leaf count = 8 (per Group) * 8 (Groups) = 64 Nodes
Total Spine count = 64 (Leafs) / 2 = 32 Nodes

Scalability Limits (CX864E-N as Spine/Leaf): When designing a compute network, scalability is limited by the Spine switch radix. For the CX864E-N (128 x 400G ports), the theoretical maximum scale is:

Max Groups Supported: 128 (Spine ports) / 8 (Leafs per Group) = 16.
Max Servers: 16 * 64 = 1024. – Max GPUs: 16 * 512 = 8192.

The following tables detail the node configuration requirements for deploying backend networks of varying GPU scales using the CX864E-N and CX732Q-N in Rail-optimized architecture:

Table 4: Node Requirements for CX864E-N

Total GPUs/Servers	Leaf Nodes	Spine Nodes	400G Links (per Leaf-Spine)
256 / 32	4	2	32
512 / 64	8	4	16
1024 / 128	16	8	8
2048 / 256	32	16	4
4096 / 512	64	32	2
8192 / 1024	128	64	1

Table 5: Node Requirements for CX732Q-N

Total GPUs / Servers	Leaf Nodes	Spine Nodes	400G Links (per Leaf-Spine)
128 / 16	8	4	4
256/32	16	8	2
512/64	32	16	1

Node Requirements Summary:

For a given cluster size, the required number of components is determined as follows:

Leaf Nodes per Group = Number of GPUs per server.
Max Servers per Group = Available Leaf ports / 2 (based on 1:1 oversubscription).
Max GPUs per Group = Max servers per Group * GPUs per server.
Total Number of Groups = Total target GPUs / Max GPUs per Group.
Total Leaf Count = Leaf nodes per Group * Total number of Groups.
Total Spine Count = (Total Leaf count * Leaf port count / 2) / Spine port count.
Note: M is the port count of the Leaf switch and N is the port count of the Spine switch

Maximum Scalability Limits Summary:

The ultimate scale of a 2-tier Clos network is physically constrained by the Spine switch radix (port count):

Max Supportable Groups = Spine available ports / Leaf nodes per Group.
Max Supportable Servers = Max supportable Groups * Max servers per Group.
Max Supportable GPUs = Max supportable Groups * Max GPUs per Group.

5. Conclusion

By leveraging Rail-only and Rail-optimized architectures, this solution minimizes communication hops between GPUs, significantly accelerating alltoall performance and reducing overall training cycles. This design provides a robust and scalable framework for AI compute fabrics of any magnitude. For detailed deployment cases and configuration specifics, please refer to our Best Practices documentation.

[1] NVSwitch: A high-speed switching chip by NVIDIA designed for Scale-Up fabrics. It enables multi-GPU communication at maximum NVLink speeds within a single node.
[2] PXN (PCIe x NVLink): A pivotal NCCL technology that allows a GPU to aggregate data via NVLink to a peer GPU directly connected to a NIC. This data is then dispatched via PCIe, significantly enhancing the efficiency of cross-node collective communication.
[3] Supported by CX864E-N.
[4] Supported by CX864E-N.

Low Latency Data Center Switch

Campus Access & Aggregation

Wireless Access Point

OpenWiFi Network Controller

Marvell OCTEON Platform

Optical Transceiver

Open Packet Broker

Network Packet Broker

P4-Programmable Switch

AI Networking

GPU Backend Fabric Design Guide for AI Compute Network

Preface

Target Audience

Overview