RoCE Beats InfiniBand: Asterfusion 800G Switch Crushes It with 20% Faster TGR
written by Asterfuison
To evaluate real-world AI inference performance, we deployed the DeepSeek large language model on an H20 high-performance server cluster—featuring dual nodes and eight GPUs per node—interconnected via both the Asterfusion ultra-low latency AI switch CX864E-N and a traditional InfiniBand switch. Through a series of rigorous tests centered on key inference metrics, CX864E-N demonstrated a clear performance edge: delivering higher throughput (TGR) and significantly lower 90th percentile inference latency (P90 ITL). The result? Substantial gains in inference efficiency. Coupled with faster processing speeds, reduced network delays, and a cost-performance ratio far superior to InfiniBand, CX864E-N empowers AI service providers to accelerate large model deployments at scale—while dramatically cutting costs.
Test Overview
This test presents the testing methodology and performance evaluation of Asterfusion CX864E-N ultra-low-latency AI switch in a DeepSeek 671B model inference scenario. The evaluation focuses on the following key areas:

DeepSeek Inference Cluster Test Network
Network topology
Figure 1 shows the overall network topology of the test network, and Figure 2 shows the internal topology of the server. Details are as follows:

The 800G switch port, equipped with an 800G OSFP optical module, establishes two 400G connections to two 400G NICs using two MPO-12 cables, each interfacing with a 400G OSFP transceiver on the respective NIC.

Figure 1 illustrates the test network connectivity. Two servers, Server 1 (hostname: jytf-d1-311-h20-d05-1) and Server 2 (hostname: jytf-d1-311-h20-d05-2), serve as AI compute nodes. Each server is equipped with eight NVIDIA H20 GPUs and four 400G NVIDIA CX-7 NICs (labeled NIC0–NIC3), connected to the Asterfusion CX864E-N AI switch to form the compute network. The test primarily evaluates the performance of the CX864E-N switch.
Network traffic diagram
A brief diagram of the network traffic of the networking switch is shown in Figure 3:


Conclusion Overview
Network forwarding and communication performance test
Device | Test Contents | Test items | Test Results | Test Results | Remark |
Test results | Theoretical limit | ||||
Aerfusion CX864E-N AI Switch | E2E forwarding test | Cross-switch forwarding bandwidth test | 392.14Gbps | 400Gbps | Measured data, theoretical limit value is calculated according to the port limit rate value |
Cross-switch forwarding delay test | 560ns | / | Switch forwarding delay = total delay – network card direct connection delay is 2.51- 1.95 = 0.56us = 560ns | ||
Weight | NCCL-test | Single node NCCL test 1 | 469.21GB/s | / | jytf-d1-311-h20-d05-1 runs nccl test measured data |
Processor | Single node NCCL test 2 | 469.49GB/s | / | jytf-d1-311-h20-d05-2 runs nccl test measured data | |
Chipset | Cross-switch dual-node all reduce parallel computing test (NCCL algorithm configuration ring) | 195.75GB/s | 200Gbps | mpirun+nccl test with NCCL_ALGO=ring parameter to get the measured data | |
Camera | Cross-switch dual-node all reduce parallel computing test (NCCL algorithm configuration tree) | 309.24GB/s | / | mpirun+nccl test with NCCL_ALGO=tree parameter to get the measured data | |
Battery | Cross-switch dual-node all-reduce parallel computing test(NCCL algorithm configuration collnet ) | 309.23GB/s | / | mpirun+nccl with NCCL_ALGO= collnet parameter Test measured data | |
Cross-switch dual-node alltoal parallel computing test (NCCL algorithm configuration ring) | 29.78GB/s | / | Under the NCCL_ALGO=ring parameter, run mpirun+nccl test to measure the data | ||
Cross-switch dual-node alltoal parallel computing test (NCCL algorithm configuration tree) | 29.34GB/s | / | mpirun+nccl test with NCCL_ALGO=tree parameter to get the measured data | ||
Resolution | Cross-switch dual-node all-to-all parallel computing test (NCCL algorithm configuration collnet) | 29.73GB/s | / | Under the NCCL_ALGO= collnet parameter, run mpirun+nccl test to measure the data |
DeepSeek inference scenario network performance test
Device test | Inference Request Concurrency | Test Results | Conclusion | |||
TTFT(s) | P90ITL(s) | TGR (tokens/s) | Character generation speed (char/s) | |||
RoCEv2 Network/ 400G | 20 | 0.94 | 0.064 | 15.51 | 26.13 | Asterfusion’s RoCEv2 outperforms InfiniBand in concurrency and stability, with only a slight trade-off in response speed. |
IB Network/ 400G | 20 | 0.76 | 0.078 | 14.77 | 27.34 | |
RoCEv2 Network/ 400G | 30 | 0.96 | 0.071 | 14.91 | 25.2 | |
IB Network/ 400G | 30 | 0.84 | 0.085 | 13.83 | 25.56 | |
RoCEv2 Network/ 400G | 50 | 0.89 | 0.09 | 13.17 | 22.08 | |
IB Network/ 400G | 50 | 0.83 | 0.113 | 11.86 | 21.89 | |
RoCEv2 Network/ 400G | 70 | 0.84 | 0.106 | 11.97 | 20.13 | |
IB Network/ 400G | 70 | 0.82 | 0.124 | 10.37 | 19.15 | |
RoCEv2 Network/ 400G | 100 | 0.98 | 0.134 | 10.7 | 17.96 | |
IB Network/ 400G | 100 | 0.95 | 0.152 | 8.39 | 15.52 |
We make it into a bar chart so that the analysis is more intuitive




Understanding Key Performance Metrics Analysis:
Before diving into the test results, it’s important to first understand the core performance metrics evaluated in this study—TTFT, P90 ITL, TGR, and Character Generation Speed. These indicators are crucial in assessing the efficiency and responsiveness of large model inference systems.
1. TTFT (Time To First Token) – First Token Latency
Definition: The time taken from when a client sends a request to when the model generates and returns the first token. A lower TTFT means faster feedback and a smoother user experience.
Plain explanation: Think of TTFT as the time between clicking “Send” and seeing the AI’s first word appear. The shorter this delay, the quicker the system feels to the user.
2. P90 ITL (90th Percentile Inter-Token Latency)
Definition: This measures the time gap between generated tokens during inference. Specifically, the P90 value indicates that 90% of token intervals fall below this threshold. A lower P90 ITL reflects smoother and more stable output with less jitter in response latency.
Plain explanation: When AI responds, it doesn’t generate the entire answer instantly—it produces it word by word (or token by token). The pause between each word is the inter-token latency. P90 ITL means that in 90% of cases, this pause doesn’t exceed a certain time. The smaller this value, the more fluid and natural the AI’s output feels—less lag, more continuity.
3. TGR (Token Generation Rate) – Average Token Output Rate
Definition: Indicates how many tokens the model can generate per second (tokens/s). It reflects the overall throughput of the inference system. The higher the TGR, the more efficient and powerful the system is.
Business relevance: TGR is the most important and key productivity metric for AI inference providers. A higher TGR means:
- More requests served per second;
- Higher output efficiency;
- Better resource utilization;
- Lower operating costs.
In short: Improving TGR = Enhancing inference efficiency = Increasing revenue potential.
4. Character Generation Speed
Definition: The average speed at which characters are generated in the final output, measured in characters per second (chars/s). Compared to token speed, this metric better reflects the actual user-perceived experience—how fast the text appears on screen.
Plain explanation: This is essentially how quickly words “pop out” on the screen. The faster characters appear, the more responsive and smooth the experience feels to users.
📊 Key Performance Metrics for AI Inference – Comparison Table
Metric | Full Name | Unit | Description | Represents | Most Relevant To | Higher/Lower is Better |
TTFT | Time To First Token | ms | Time from when the request is sent to when the first token is received | Initial response time | End users | Lower |
ITL | Inter-Token Latency | ms | Average or P90 latency between two consecutive tokens during generation | Smoothness and consistency | End users & system performance | Lower |
TGR | Token Generation Rate | tokens/s | Number of tokens generated per second | Throughput and inference efficiency | AI service providers | Higher |
Character Generation Speed | _ | chars/s | Average speed of generating visible characters to users | Perceived output speed | End users | Higher |
Test Data Analysis: Why the Asterfusion RoCE Switch Stands Out

The test results clearly show that Asterfusion’s RoCE switch delivers lower P90 ITL and higher TGR compared to traditional InfiniBand (IB) switches. These advantages highlight several key strengths:
1️⃣ Smoother Inference and Better User Experience
- Lower P90 ITL indicates reduced latency between token generations, resulting in a more stable and fluid output process.
- For users, this means more natural, uninterrupted AI responses. For the system, it translates into more efficient communication with less waiting between nodes.
2️⃣Stronger System Throughput
- Higher TGR (Token Generation Rate) means the model can produce more tokens per second, boosting overall inference throughput.
- In the same amount of time, more inference tasks can be completed, or larger models can be served efficiently.
3️⃣RoCE Architecture Is Better Suited for Inference Workloads
- The results demonstrate that Asterfusion’s RoCE switch is better optimized for AI inference traffic in multi-GPU, multi-node deployments.
- Compared to traditional InfiniBand, RoCE offers greater flexibility and superior cost-performance.
4️⃣A More Efficient Network = Greater Business Value
- Low latency + high throughput means shorter response times and greater concurrency—enabling more to be done with the same hardware.
- For AI service providers, this directly translates to lower costs, higher revenue, and better scalability.
Among all performance metrics, TGR (Token Generation Rate) is especially crucial. It directly reflects the system’s output efficiency and business value. The higher the TGR, the more tokens an AI provider can generate—and monetize—per second. Here’s why that matters:
Why TGR Is a Core Indicator of AI Profitability?
1️⃣ Higher Request Concurrency
- A higher TGR allows more tokens to be generated per second, meaning more user requests can be processed simultaneously—boosting service capacity.
2️⃣Shorter Task Times, Better Resource Utilization
- With a high TGR, each inference task completes faster, freeing up server resources more quickly.
- This allows the same hardware to handle more users or tasks, significantly improving ROI (Return on Investment).
3️⃣Lower Compute Costs, Higher Profit Margins
- Systems with high TGR can achieve the same scale of inference with fewer servers, reducing power, rental, and maintenance costs.
- With better efficiency and lower overhead, profitability increases naturally.
4️⃣Better User Experience and Retention
- Faster output means more responsive interaction, which enhances user satisfaction.
- A smoother experience encourages customer loyalty and subscription renewals.
In a single sentence: The higher the TGR, the more tokens an AI provider can generate per unit of time—essentially boosting AI productivity. This means completing more orders at lower cost, ultimately increasing profit potential.
Asterfusion 800G AI RoCE Switch: A Breakthrough in Both Performance and Cost
- Unmatched Performance: With higher TGR and lower P90 ITL, it supercharges inference speed, responsiveness, and system throughput.
- Game-Changing Cost Advantage: At just one-third the price of traditional InfiniBand solutions, it slashes infrastructure costs without compromising performance.
- Built for Scalable AI: Faster responses, lower latency, and industry-leading cost-efficiency empower AI providers to scale faster, serve more users, and do more with less.
✅ Bottom Line: The Asterfusion RoCE switch delivers exceptional performance at a fraction of the cost—boosting inference productivity, maximizing ROI, and giving AI service providers a clear edge in the race to commercial success.
For more About Comparison between IB and ROCE
RoCE Or InfiniBand ?The Most Comprehensive Technical Comparison (Ⅰ)
RoCE Or InfiniBand ?The Most Comprehensive Technical Comparison (II)