AI Training Network
AI Training Network
AI Training Network
The Ultra Ethernet, featuring a super-large-scale (256K GPU) and ultra-low-latency (500ns) switch (CX864E) with ultra-high network utilization (97%), accelerates the job completion time(JCT) by 3-5% compared to InfiniBand (IB) networks.
256K GPU
500ns
Extremely Low Latency
97%
Network Utilization
↓ 3-5%
JCT Beyond IB
The Value of AI Training Network

AI Training Network Topology
- Reduce JCT
Compared with InfiniBand networks, job completion time(JCT) can be shortened by 3-5%. - Reduce Tail Latency
Reduces end-to-end tail latency by up to 11% compared with traditional RoCE/IB switches. - Congestion Free
INT-driven adaptive routing, packet spray, flowlet-based auto load balancing, and WCMP work together to effectively prevent congestion across the fabric. - Improve GPU & Network Utilization
With the above technologies, the network achieves up to 97% utilization, directly enhancing GPU utilization during parallel computing workloads. - 256K GPU cluster with 400G per GPU
Implemented with a two-tier Clos architecture featuring eight rails, each consisting of 256 spine switches and 512 leaf switches.
Traffic
Enables various training communication patterns with In-band-Network-Telemetry-driven topology-aware optimization, including Ring, Tree, Halving and Doubling, and Pipelined implementations of All-Reduce, All-Gather, and All-to-All. Improves GPU and network utilization.
Test Result
Llama 2 Large Model Training Test
|
Test the Llama2-7B model with a sample sequence length of 2048, and conduct parallel training tests using 16 GPU cards. The results show that when using a single Asterfusion RoCE switch, the training time per step is the same as that of the IB switch, and both are equal to the time when directly connected to the network card. When two switches are interconnected back-to-back, the training time of the Asterfusion RoCE switch is 3.38% lower than that of the IB switch.

Test the use of 3 CX864E-N switches to form a spine-leaf network, and use the IXIA AresONE-S 400G tester to simulate a cluster composed of 16 GPUs. After enabling the INT-Driven Adaptive Routing function, compared with the traditional RoCE switches and IB switches’ ECMP five-tuple hash load balancing technology, the tail latency P95 FCT value decreased by 11.13% respectively. Taking the training of the typical LLaMA-39B model with a communication time proportion of about 55% as an example, the job completion time (JCT) is reduced by about 6.12%.