Skip to main content

RoCE Beats InfiniBand: Asterfusion 800G Switch Crushes It with 20% Faster TGR

written by Asterfuison

May 19, 2025

To evaluate real-world AI inference performance, we deployed the DeepSeek large language model on an H20 high-performance server cluster—featuring dual nodes and eight GPUs per node—interconnected via both the Asterfusion ultra-low latency AI switch CX864E-N and a traditional InfiniBand switch. Through a series of rigorous tests centered on key inference metrics, CX864E-N demonstrated a clear performance edge: delivering higher throughput (TGR) and significantly lower 90th percentile inference latency (P90 ITL). The result? Substantial gains in inference efficiency. Coupled with faster processing speeds, reduced network delays, and a cost-performance ratio far superior to InfiniBand, CX864E-N empowers AI service providers to accelerate large model deployments at scale—while dramatically cutting costs.

Test Overview

This test presents the testing methodology and performance evaluation of Asterfusion CX864E-N ultra-low-latency AI switch in a DeepSeek 671B model inference scenario. The evaluation focuses on the following key areas:

800G test Report Overview

DeepSeek Inference Cluster Test Network

Network topology

Figure 1 shows the overall network topology of the test network, and Figure 2 shows the internal topology of the server. Details are as follows:

Figure 1 Overall network topology of the test network

The 800G switch port, equipped with an 800G OSFP optical module, establishes two 400G connections to two 400G NICs using two MPO-12 cables, each interfacing with a 400G OSFP transceiver on the respective NIC.

Figure 2 Server internal topology

Figure 1 illustrates the test network connectivity. Two servers, Server 1 (hostname: jytf-d1-311-h20-d05-1) and Server 2 (hostname: jytf-d1-311-h20-d05-2), serve as AI compute nodes. Each server is equipped with eight NVIDIA H20 GPUs and four 400G NVIDIA CX-7 NICs (labeled NIC0–NIC3), connected to the Asterfusion CX864E-N AI switch to form the compute network. The test primarily evaluates the performance of the CX864E-N switch.

Network traffic diagram

A brief diagram of the network traffic of the networking switch is shown in Figure 3:

 Network traffic diagram of network switches
Figure 3 Network traffic diagram of network switches

Conclusion Overview

Network forwarding and communication performance test

Device
Test Contents
Test items
Test Results
Remark
Test results
Theoretical limit
Aerfusion CX864E-N AI Switch
E2E forwarding test
Cross-switch forwarding bandwidth test
392.14Gbps
400Gbps
Measured data, theoretical limit value is calculated according to the port limit rate value
Cross-switch forwarding delay test
560ns
/
Switch forwarding delay = total delay – network card direct connection delay is 2.51- 1.95 = 0.56us = 560ns
NCCL-test
Single node NCCL test 1
469.21GB/s
/
jytf-d1-311-h20-d05-1 runs nccl test measured data
Single node NCCL test 2
469.49GB/s
/
jytf-d1-311-h20-d05-2 runs nccl test measured data
Cross-switch dual-node all reduce parallel computing test (NCCL algorithm configuration ring)
195.75GB/s
200Gbps
mpirun+nccl test with NCCL_ALGO=ring parameter to get the measured data
Cross-switch dual-node all reduce parallel computing test (NCCL algorithm configuration tree)
309.24GB/s
/
mpirun+nccl test with NCCL_ALGO=tree parameter to get the measured data
Cross-switch dual-node all-reduce parallel computing test(NCCL algorithm configuration collnet )
309.23GB/s
/
mpirun+nccl with NCCL_ALGO= collnet parameter Test measured data
Cross-switch dual-node alltoal parallel computing test (NCCL algorithm configuration ring)
29.78GB/s
/
Under the NCCL_ALGO=ring parameter, run mpirun+nccl test to measure the data
Cross-switch dual-node alltoal parallel computing test (NCCL algorithm configuration tree)
29.34GB/s
/
mpirun+nccl test with NCCL_ALGO=tree parameter to get the measured data
Cross-switch dual-node all-to-all parallel computing test (NCCL algorithm configuration collnet)
29.73GB/s
/
Under the NCCL_ALGO= collnet parameter, run mpirun+nccl test to measure the data

DeepSeek inference scenario network performance test

Device test
Inference Request
Concurrency
Test Results
Conclusion
TTFT(s)
P90ITL(s)
TGR (tokens/s)
Character generation speed (char/s)
RoCEv2 Network/ 400G
20
0.94
0.064
15.51
26.13
Asterfusion’s RoCEv2 outperforms InfiniBand in concurrency and stability, with only a slight trade-off in response speed.
IB Network/ 400G
20
0.76
0.078
14.77
27.34
RoCEv2 Network/ 400G
30
0.96
0.071
14.91
25.2
IB Network/ 400G
30
0.84
0.085
13.83
25.56
RoCEv2 Network/ 400G
50
0.89
0.09
13.17
22.08
IB Network/ 400G
50
0.83
0.113
11.86
21.89
RoCEv2 Network/ 400G
70
0.84
0.106
11.97
20.13
IB Network/ 400G
70
0.82
0.124
10.37
19.15
RoCEv2 Network/ 400G
100
0.98
0.134
10.7
17.96
IB Network/ 400G
100
0.95
0.152
8.39
15.52

We make it into a bar chart so that the analysis is more intuitive

Understanding Key Performance Metrics Analysis:

Before diving into the test results, it’s important to first understand the core performance metrics evaluated in this study—TTFT, P90 ITL, TGR, and Character Generation Speed. These indicators are crucial in assessing the efficiency and responsiveness of large model inference systems.

1. TTFT (Time To First Token) – First Token Latency

Definition: The time taken from when a client sends a request to when the model generates and returns the first token. A lower TTFT means faster feedback and a smoother user experience.

2. P90 ITL (90th Percentile Inter-Token Latency)

Definition: This measures the time gap between generated tokens during inference. Specifically, the P90 value indicates that 90% of token intervals fall below this threshold. A lower P90 ITL reflects smoother and more stable output with less jitter in response latency.

3. TGR (Token Generation Rate) – Average Token Output Rate

Definition: Indicates how many tokens the model can generate per second (tokens/s). It reflects the overall throughput of the inference system. The higher the TGR, the more efficient and powerful the system is.

Business relevance: TGR is the most important and key productivity metric for AI inference providers. A higher TGR means:

  • More requests served per second;
  • Higher output efficiency;
  • Better resource utilization;
  • Lower operating costs.

4. Character Generation Speed

Definition: The average speed at which characters are generated in the final output, measured in characters per second (chars/s). Compared to token speed, this metric better reflects the actual user-perceived experience—how fast the text appears on screen.

📊 Key Performance Metrics for AI Inference – Comparison Table

Metric
Full Name
Unit
Description
Represents
Most Relevant To
Higher/Lower is Better
TTFT
Time To First Token
ms
Time from when the request is sent to when the first token is received
Initial response time
End users
Lower
ITL
Inter-Token Latency
ms
Average or P90 latency between two consecutive tokens during generation
Smoothness and consistency
End users & system performance
Lower
TGR
Token Generation Rate
tokens/s
Number of tokens generated per second
Throughput and inference efficiency
AI service providers
Higher
Character Generation Speed
_
chars/s
Average speed of generating visible characters to users
Perceived output speed
End users
Higher

Test Data Analysis: Why the Asterfusion RoCE Switch Stands Out

The test results clearly show that Asterfusion’s RoCE switch delivers lower P90 ITL and higher TGR compared to traditional InfiniBand (IB) switches. These advantages highlight several key strengths:

1️⃣ Smoother Inference and Better User Experience

  • Lower P90 ITL indicates reduced latency between token generations, resulting in a more stable and fluid output process.
  • For users, this means more natural, uninterrupted AI responses. For the system, it translates into more efficient communication with less waiting between nodes.

2️⃣Stronger System Throughput

  • Higher TGR (Token Generation Rate) means the model can produce more tokens per second, boosting overall inference throughput.
  • In the same amount of time, more inference tasks can be completed, or larger models can be served efficiently.

3️⃣RoCE Architecture Is Better Suited for Inference Workloads

  • The results demonstrate that Asterfusion’s RoCE switch is better optimized for AI inference traffic in multi-GPU, multi-node deployments.
  • Compared to traditional InfiniBand, RoCE offers greater flexibility and superior cost-performance.

4️⃣A More Efficient Network = Greater Business Value

  • Low latency + high throughput means shorter response times and greater concurrency—enabling more to be done with the same hardware.
  • For AI service providers, this directly translates to lower costs, higher revenue, and better scalability.

Among all performance metrics, TGR (Token Generation Rate) is especially crucial. It directly reflects the system’s output efficiency and business value. The higher the TGR, the more tokens an AI provider can generate—and monetize—per second. Here’s why that matters:

Why TGR Is a Core Indicator of AI Profitability?

1️⃣ Higher Request Concurrency

  • A higher TGR allows more tokens to be generated per second, meaning more user requests can be processed simultaneously—boosting service capacity.

2️⃣Shorter Task Times, Better Resource Utilization

  • With a high TGR, each inference task completes faster, freeing up server resources more quickly.
  • This allows the same hardware to handle more users or tasks, significantly improving ROI (Return on Investment).

3️⃣Lower Compute Costs, Higher Profit Margins

  • Systems with high TGR can achieve the same scale of inference with fewer servers, reducing power, rental, and maintenance costs.
  • With better efficiency and lower overhead, profitability increases naturally.

4️⃣Better User Experience and Retention

  • Faster output means more responsive interaction, which enhances user satisfaction.
  • A smoother experience encourages customer loyalty and subscription renewals.

Asterfusion 800G AI RoCE Switch: A Breakthrough in Both Performance and Cost

  • Unmatched Performance: With higher TGR and lower P90 ITL, it supercharges inference speed, responsiveness, and system throughput.
  • Built for Scalable AI: Faster responses, lower latency, and industry-leading cost-efficiency empower AI providers to scale faster, serve more users, and do more with less.

Bottom Line: The Asterfusion RoCE switch delivers exceptional performance at a fraction of the cost—boosting inference productivity, maximizing ROI, and giving AI service providers a clear edge in the race to commercial success.

For more About Comparison between IB and ROCE

RoCE Or InfiniBand ?The Most Comprehensive Technical Comparison (Ⅰ)

RoCE Or InfiniBand ?The Most Comprehensive Technical Comparison (II)

Latest Posts