AI Training Network

The Ultra Ethernet, featuring a super-large-scale (256K GPU) and ultra-low-latency (500ns) switch (CX864E) with ultra-high network utilization (97%), accelerates the job completion time(JCT) by 3-5% compared to InfiniBand (IB) networks.

256K GPU

Ultra Large Scale

500ns

Extremely Low Latency

97%

Network Utilization

↓ 3-5%

JCT Beyond IB

The Value of AI Training Network

AI Training Network Topology

Click to enlarge

Reduce JCT
Compared with InfiniBand networks, job completion time(JCT) can be shortened by 3-5%.
Reduce Tail Latency
Reduces end-to-end tail latency by up to 11% compared with traditional RoCE/IB switches.
Congestion Free
INT-driven adaptive routing, packet spray, flowlet-based auto load balancing, and WCMP work together to effectively prevent congestion across the fabric.
Improve GPU & Network Utilization
With the above technologies, the network achieves up to 97% utilization, directly enhancing GPU utilization during parallel computing workloads.
256K GPU cluster with 400G per GPU
Implemented with a two-tier Clos architecture featuring eight rails, each consisting of 256 spine switches and 512 leaf switches.

Traffic

Enables various training communication patterns with In-band-Network-Telemetry-driven topology-aware optimization, including Ring, Tree, Halving and Doubling, and Pipelined implementations of All-Reduce, All-Gather, and All-to-All. Improves GPU and network utilization.

Test Result

Llama 2 Large Model Training Test

	Test Result					Latency Sensitivity	Throughput Demand	Test Result
Test Item	NIC Direct Connection	Via Single Asterfusion Switches (RoCE)	Via Two Asterfusion Switches (RoCE)	Via Single Mellanox Switches (IB)	Via Two Mellanox Switches (IB)
Large Model Single-step Training Time	878ms	878ms	885ms	878ms	916ms

Test the Llama2-7B model with a sample sequence length of 2048, and conduct parallel training tests using 16 GPU cards. The results show that when using a single Asterfusion RoCE switch, the training time per step is the same as that of the IB switch, and both are equal to the time when directly connected to the network card. When two switches are interconnected back-to-back, the training time of the Asterfusion RoCE switch is 3.38% lower than that of the IB switch.

AI Training Network All-reduceBidir-RingP95-FCT-Comparison

Test the use of 3 CX864E-N switches to form a spine-leaf network, and use the IXIA AresONE-S 400G tester to simulate a cluster composed of 16 GPUs. After enabling the INT-Driven Adaptive Routing function, compared with the traditional RoCE switches and IB switches’ ECMP five-tuple hash load balancing technology, the tail latency P95 FCT value decreased by 11.13% respectively. Taking the training of the typical LLaMA-39B model with a communication time proportion of about 55% as an example, the job completion time (JCT) is reduced by about 6.12%.

Related Products

Low Latency Data Center Switch

Campus Access & Aggregation

Wireless Access Point

OpenWiFi Network Controller

Enterprise Ready SONiC NOS

Optical Transceiver

Marvell OCTEON Platform

Network Packet Broker

P4-Programmable Switch

AI Networking

AI Training Network

AI Training Network

256K GPU

500ns

97%

↓ 3-5%

The Value of AI Training Network

AI Training Network Topology

Traffic

Ring All Gather

Tree All Reduce

Recursive All Gather

Pipelined All-to-All

Test Result

Llama 2 Large Model Training Test

Related Products

800GbE Switch with 64x OSFP Ports, 51.2Tbps, Enterprise SONiC Ready

32-Port 400G QSFP-DD Data Center Switch for AI/ML Enterprise SONiC Ready

400G/800G Optical Transcivers

Low Latency Data Center Switch

Campus Access & Aggregation

Wireless Access Point

OpenWiFi Network Controller

Enterprise Ready SONiC NOS

Optical Transceiver

Marvell OCTEON Platform

Network Packet Broker

P4-Programmable Switch

AI Networking

Data Center

Enterprise

Carrier Network

Blogs

AI Training Network

AI Training Network

256K GPU

500ns

97%

↓ 3-5%

The Value of AI Training Network

AI Training Network Topology

Traffic

Test Result

Llama 2 Large Model Training Test

Related Products

800GbE Switch with 64x OSFP Ports, 51.2Tbps, Enterprise SONiC Ready

32-Port 400G QSFP-DD Data Center Switch for AI/ML Enterprise SONiC Ready

400G/800G Optical Transcivers