AI Inference Network
AI Inference Network
AI Training Network
An all-to-all network with ultra-low latency and congestion-free performance, optimized for MoE architecture-based reasoning large models (such as DeepSeek), drives the Token Generation Rate and revenue to surge by over 27.5%.
All-to-all
Ultra-low Latency
Congestion-Free
Network
↑ 27.5%
Token Generate Rate
↑ 27.5%
The Value of AI Inference Network

AI Inference Network Topology
- Improve Token Generation Rate
Token Generation Rate (TGR) increases steadily with the number of concurrent users. When reaching 100 users, TGR improves by 27.5% compared to InfiniBand. - Reduce Inference Latency
By minimizing leaf-to-leaf network forwarding latency, the average inference time per token is reduced by 20.4% compared to InfiniBand. - Congestion Free
INT-driven adaptive routing, packet spray, flowlet-based auto load balancing, and WCMP work together to effectively prevent congestion across the fabric. - Improve GPU & Network Utilization
With the above technologies, the network achieves up to 97% utilization, directly enhancing GPU utilization during parallel computing workloads.
Traffic
During MoE model inference, traffic types such as All-to-All, Prefill-Decode Disaggregation, and Pipeline Parallelism are intelligently distinguished and routed using In-band-Network-Telemetry-driven, topology-aware optimization, which incorporates differentiated QoS priorities and ECMP group selection to achieve a balance between high throughput and low latency — ultimately enhancing network utilization.
Test Result
Two servers are used as AI computing nodes to perform inference on the DeepSeek-R1 671B model. Each server is equipped with 8 x NVIDIA H20 GPU cards and 4 x 400G CX-7 network cards, and is connected to an AI switch of the Asterfusion CX864E-N model to form a computing power network. Test metrics such as the latency of each inference and TGR. And conduct a comparative test under the same conditions with an IB switch (QM9700).

Under different concurrent inference request scenarios (ranging from 20 to 100), the inference latency when using Asterfusion’s RoCE switch is consistently lower than that when using an InfiniBand (IB) switch. Specifically, with 50 concurrent requests, the 90th percentile inference latency is reduced by 20.4%.
For 20–100 concurrent inference requests, Asterfusion’s RoCE switch consistently delivers a higher Token Generation Rate (TGR) than InfiniBand (IB) switches. The growth margin widens with more concurrent requests, reaching a 27.5% TGR improvement at 100 requests.
