Skip to main content

AI Training Network

Ultra Ethernet, Accelerate AI Training Beyond InfiniBand
| 256K GPUs | 500ns Latency | 97% Utilization

AI Training Network

Accelerate AI Training Beyond InfiniBand
  

The Ultra Ethernet, featuring a super-large-scale (256K GPU) and ultra-low-latency (500ns) switch (CX864E) with ultra-high network utilization (97%), accelerates the job completion time(JCT) by 3-5% compared to InfiniBand (IB) networks.

256K GPU

Ultra Large Scale

500ns

Extremely Low Latency

97%

Network Utilization

↓ 3-5%

JCT Beyond IB

The Value of AI Training Network

AI Training Network Topology

AI Training Network Topology

  • Reduce JCT 
    Compared with InfiniBand networks, job completion time(JCT) can be shortened by 3-5%.
  • Reduce Tail Latency
    Reduces end-to-end tail latency by up to 11% compared with traditional RoCE/IB switches.
  • Congestion Free
    INT-driven adaptive routing, packet spray, flowlet-based auto load balancing, and WCMP work together to effectively prevent congestion across the fabric.
  • Improve GPU & Network Utilization
    With the above technologies, the network achieves up to 97% utilization, directly enhancing GPU utilization during parallel computing workloads.
  • 256K GPU cluster with 400G per GPU
    Implemented with a two-tier Clos architecture featuring eight rails, each consisting of 256 spine switches and 512 leaf switches.

Traffic

Enables various training communication patterns with In-band-Network-Telemetry-driven topology-aware optimization, including Ring, Tree, Halving and Doubling, and Pipelined implementations of All-Reduce, All-Gather, and All-to-All. Improves GPU and network utilization.

Test Result

Llama 2 Large Model Training Test

Test Result
Test Item
NIC Direct
Connection
Via Single Asterfusion
Switches (RoCE)
Via Two Asterfusion
Switches (RoCE)
Via Single Mellanox
Switches (IB)
Via Two Mellanox
Switches (IB)
Large Model Single-step Training Time
878ms
878ms
885ms
878ms
916ms

Test the Llama2-7B model with a sample sequence length of 2048, and conduct parallel training tests using 16 GPU cards. The results show that when using a single Asterfusion RoCE switch, the training time per step is the same as that of the IB switch, and both are equal to the time when directly connected to the network card. When two switches are interconnected back-to-back, the training time of the Asterfusion RoCE switch is 3.38% lower than that of the IB switch.

AI Training Network All-reduceBidir-RingP95-FCT-Comparison

Test the use of 3 CX864E-N switches to form a spine-leaf network, and use the IXIA AresONE-S 400G tester to simulate a cluster composed of 16 GPUs. After enabling the INT-Driven Adaptive Routing function, compared with the traditional RoCE switches and IB switches’ ECMP five-tuple hash load balancing technology, the tail latency P95 FCT value decreased by 11.13% respectively. Taking the training of the typical LLaMA-39B model with a communication time proportion of about 55% as an example, the job completion time (JCT) is reduced by about 6.12%.