RoCE Beats InfiniBand: Asterfusion 800G Switch Crushes It with 20% Faster TGR

May 19, 2025

To evaluate real-world AI inference performance, we deployed the DeepSeek large language model on an H20 high-performance server cluster—featuring dual nodes and eight GPUs per node—interconnected via both the Asterfusion ultra-low latency AI switch CX864E-N and a traditional InfiniBand switch. Through a series of rigorous tests centered on key inference metrics, CX864E-N demonstrated a clear performance edge: delivering higher throughput (TGR) and significantly lower 90th percentile inference latency (P90 ITL). The result? Substantial gains in inference efficiency. Coupled with faster processing speeds, reduced network delays, and a cost-performance ratio far superior to InfiniBand, CX864E-N empowers AI service providers to accelerate large model deployments at scale—while dramatically cutting costs.

Test Overview

This test presents the testing methodology and performance evaluation of Asterfusion CX864E-N ultra-low-latency AI switch in a DeepSeek 671B model inference scenario. The evaluation focuses on the following key areas:

DeepSeek Inference Cluster Test Network

Network topology

Figure 1 shows the overall network topology of the test network, and Figure 2 shows the internal topology of the server. Details are as follows:

The 800G switch port, equipped with an 800G OSFP optical module, establishes two 400G connections to two 400G NICs using two MPO-12 cables, each interfacing with a 400G OSFP transceiver on the respective NIC.

Figure 1 illustrates the test network connectivity. Two servers, Server 1 (hostname: jytf-d1-311-h20-d05-1) and Server 2 (hostname: jytf-d1-311-h20-d05-2), serve as AI compute nodes. Each server is equipped with eight NVIDIA H20 GPUs and four 400G NVIDIA CX-7 NICs (labeled NIC0–NIC3), connected to the Asterfusion CX864E-N AI switch to form the compute network. The test primarily evaluates the performance of the CX864E-N switch.

Network traffic diagram

A brief diagram of the network traffic of the networking switch is shown in Figure 3:

Conclusion Overview

Network forwarding and communication performance test

Device	Test Contents	Test items	Test Results		Test Results	Remark
Device	Test Contents	Test items				Remark	Test results	Theoretical limit
Aerfusion CX864E-N AI Switch	E2E forwarding test	Cross-switch forwarding bandwidth test	392.14Gbps	400Gbps	Measured data, theoretical limit value is calculated according to the port limit rate value
	E2E forwarding test			Cross-switch forwarding delay test	560ns	/	Switch forwarding delay = total delay – network card direct connection delay is 2.51- 1.95 = 0.56us = 560ns
	Weight	NCCL-test	Single node NCCL test 1	469.21GB/s	/	jytf-d1-311-h20-d05-1 runs nccl test measured data
	Processor			Single node NCCL test 2	469.49GB/s	/	jytf-d1-311-h20-d05-2 runs nccl test measured data
	Chipset			Cross-switch dual-node all reduce parallel computing test (NCCL algorithm configuration ring)	195.75GB/s	200Gbps	mpirun+nccl test with NCCL_ALGO=ring parameter to get the measured data
	Camera			Cross-switch dual-node all reduce parallel computing test (NCCL algorithm configuration tree)	309.24GB/s	/	mpirun+nccl test with NCCL_ALGO=tree parameter to get the measured data
	Battery			Cross-switch dual-node all-reduce parallel computing test(NCCL algorithm configuration collnet )	309.23GB/s	/	mpirun+nccl with NCCL_ALGO= collnet parameter Test measured data
				Cross-switch dual-node alltoal parallel computing test (NCCL algorithm configuration ring)	29.78GB/s	/	Under the NCCL_ALGO=ring parameter, run mpirun+nccl test to measure the data
				Cross-switch dual-node alltoal parallel computing test (NCCL algorithm configuration tree)	29.34GB/s	/	mpirun+nccl test with NCCL_ALGO=tree parameter to get the measured data
	Resolution			Cross-switch dual-node all-to-all parallel computing test (NCCL algorithm configuration collnet)	29.73GB/s	/	Under the NCCL_ALGO= collnet parameter, run mpirun+nccl test to measure the data

DeepSeek inference scenario network performance test

Device test	Inference Request Concurrency	Test Results						Conclusion
Device test	Inference Request Concurrency			TTFT(s)	P90ITL(s)	TGR (tokens/s)	Character generation speed (char/s)	Conclusion
RoCEv2 Network/ 400G	20	0.94	0.064	15.51	26.13	Asterfusion’s RoCEv2 outperforms InfiniBand in concurrency and stability, with only a slight trade-off in response speed.
IB Network/ 400G	20	0.76	0.078	14.77	27.34
RoCEv2 Network/ 400G	30	0.96	0.071	14.91	25.2
IB Network/ 400G	30	0.84	0.085	13.83	25.56
RoCEv2 Network/ 400G	50	0.89	0.09	13.17	22.08
IB Network/ 400G	50	0.83	0.113	11.86	21.89
RoCEv2 Network/ 400G	70	0.84	0.106	11.97	20.13
IB Network/ 400G	70	0.82	0.124	10.37	19.15
RoCEv2 Network/ 400G	100	0.98	0.134	10.7	17.96
IB Network/ 400G	100	0.95	0.152	8.39	15.52

We make it into a bar chart so that the analysis is more intuitive

Token Generation Rate — Number of tokens generated per second

P90ITL data -AI — Number of tokens generated per second

Average speed of generating visible characters to users

TTFT data — Average speed of generating visible characters to users

Understanding Key Performance Metrics Analysis:

Before diving into the test results, it’s important to first understand the core performance metrics evaluated in this study—TTFT, P90 ITL, TGR, and Character Generation Speed. These indicators are crucial in assessing the efficiency and responsiveness of large model inference systems.

1. TTFT (Time To First Token) – First Token Latency

Definition: The time taken from when a client sends a request to when the model generates and returns the first token. A lower TTFT means faster feedback and a smoother user experience.

Plain explanation: Think of TTFT as the time between clicking “Send” and seeing the AI’s first word appear. The shorter this delay, the quicker the system feels to the user.

2. P90 ITL (90th Percentile Inter-Token Latency)

Definition: This measures the time gap between generated tokens during inference. Specifically, the P90 value indicates that 90% of token intervals fall below this threshold. A lower P90 ITL reflects smoother and more stable output with less jitter in response latency.

Plain explanation: When AI responds, it doesn’t generate the entire answer instantly—it produces it word by word (or token by token). The pause between each word is the inter-token latency. P90 ITL means that in 90% of cases, this pause doesn’t exceed a certain time. The smaller this value, the more fluid and natural the AI’s output feels—less lag, more continuity.

3. TGR (Token Generation Rate) – Average Token Output Rate

Definition: Indicates how many tokens the model can generate per second (tokens/s). It reflects the overall throughput of the inference system. The higher the TGR, the more efficient and powerful the system is.

Business relevance: TGR is the most important and key productivity metric for AI inference providers. A higher TGR means:

More requests served per second;
Higher output efficiency;
Better resource utilization;
Lower operating costs.

In short: Improving TGR = Enhancing inference efficiency = Increasing revenue potential.

4. Character Generation Speed

Definition: The average speed at which characters are generated in the final output, measured in characters per second (chars/s). Compared to token speed, this metric better reflects the actual user-perceived experience—how fast the text appears on screen.

Plain explanation: This is essentially how quickly words “pop out” on the screen. The faster characters appear, the more responsive and smooth the experience feels to users.

📊 Key Performance Metrics for AI Inference – Comparison Table

Metric	Full Name	Unit	Description	Represents	Most Relevant To	Higher/Lower is Better
TTFT	Time To First Token	ms	Time from when the request is sent to when the first token is received	Initial response time	End users	Lower
ITL	Inter-Token Latency	ms	Average or P90 latency between two consecutive tokens during generation	Smoothness and consistency	End users & system performance	Lower
TGR	Token Generation Rate	tokens/s	Number of tokens generated per second	Throughput and inference efficiency	AI service providers	Higher
Character Generation Speed	_	chars/s	Average speed of generating visible characters to users	Perceived output speed	End users	Higher

Test Data Analysis: Why the Asterfusion RoCE Switch Stands Out

The test results clearly show that Asterfusion’s RoCE switch delivers lower P90 ITL and higher TGR compared to traditional InfiniBand (IB) switches. These advantages highlight several key strengths:

1️⃣ Smoother Inference and Better User Experience

Lower P90 ITL indicates reduced latency between token generations, resulting in a more stable and fluid output process.
For users, this means more natural, uninterrupted AI responses. For the system, it translates into more efficient communication with less waiting between nodes.

2️⃣Stronger System Throughput

Higher TGR (Token Generation Rate) means the model can produce more tokens per second, boosting overall inference throughput.
In the same amount of time, more inference tasks can be completed, or larger models can be served efficiently.

3️⃣RoCE Architecture Is Better Suited for Inference Workloads

The results demonstrate that Asterfusion’s RoCE switch is better optimized for AI inference traffic in multi-GPU, multi-node deployments.
Compared to traditional InfiniBand, RoCE offers greater flexibility and superior cost-performance.

4️⃣A More Efficient Network = Greater Business Value

Low latency + high throughput means shorter response times and greater concurrency—enabling more to be done with the same hardware.
For AI service providers, this directly translates to lower costs, higher revenue, and better scalability.

Among all performance metrics, TGR (Token Generation Rate) is especially crucial. It directly reflects the system’s output efficiency and business value. The higher the TGR, the more tokens an AI provider can generate—and monetize—per second. Here’s why that matters:

Why TGR Is a Core Indicator of AI Profitability？

1️⃣ Higher Request Concurrency

A higher TGR allows more tokens to be generated per second, meaning more user requests can be processed simultaneously—boosting service capacity.

2️⃣Shorter Task Times, Better Resource Utilization

With a high TGR, each inference task completes faster, freeing up server resources more quickly.
This allows the same hardware to handle more users or tasks, significantly improving ROI (Return on Investment).

3️⃣Lower Compute Costs, Higher Profit Margins

Systems with high TGR can achieve the same scale of inference with fewer servers, reducing power, rental, and maintenance costs.
With better efficiency and lower overhead, profitability increases naturally.

4️⃣Better User Experience and Retention

Faster output means more responsive interaction, which enhances user satisfaction.
A smoother experience encourages customer loyalty and subscription renewals.

In a single sentence: The higher the TGR, the more tokens an AI provider can generate per unit of time—essentially boosting AI productivity. This means completing more orders at lower cost, ultimately increasing profit potential.

Asterfusion 800G AI RoCE Switch: A Breakthrough in Both Performance and Cost

Unmatched Performance: With higher TGR and lower P90 ITL, it supercharges inference speed, responsiveness, and system throughput.
Game-Changing Cost Advantage: At just one-third the price of traditional InfiniBand solutions, it slashes infrastructure costs without compromising performance.
Built for Scalable AI: Faster responses, lower latency, and industry-leading cost-efficiency empower AI providers to scale faster, serve more users, and do more with less.

✅ Bottom Line: The Asterfusion RoCE switch delivers exceptional performance at a fraction of the cost—boosting inference productivity, maximizing ROI, and giving AI service providers a clear edge in the race to commercial success.

800GbE Switch with 64x OSFP Ports, 51.2Tbps, Enterprise SONiC Ready

Select options

For more About Comparison between IB and ROCE

RoCE Or InfiniBand ？The Most Comprehensive Technical Comparison （Ⅰ）

RoCE Or InfiniBand ？The Most Comprehensive Technical Comparison （II）

Latest Posts

ultra-low-latency-live-streaming with-ptp-1-layout

Low Latency Data Center Switch

Campus Access & Aggregation

Wireless Access Point

OpenWiFi Network Controller

Enterprise Ready SONiC NOS

Optical Transceiver

Marvell OCTEON Platform

Network Packet Broker

P4-Programmable Switch

AI Networking

RoCE Beats InfiniBand: Asterfusion 800G Switch Crushes It with 20% Faster TGR

Test Overview

DeepSeek Inference Cluster Test Network

Network topology

Network traffic diagram

Conclusion Overview

Network forwarding and communication performance test

DeepSeek inference scenario network performance test

Understanding Key Performance Metrics Analysis:

1. TTFT (Time To First Token) – First Token Latency

2. P90 ITL (90th Percentile Inter-Token Latency)

3. TGR (Token Generation Rate) – Average Token Output Rate

4. Character Generation Speed

📊 Key Performance Metrics for AI Inference – Comparison Table

Test Data Analysis: Why the Asterfusion RoCE Switch Stands Out

Why TGR Is a Core Indicator of AI Profitability？

Asterfusion 800G AI RoCE Switch: A Breakthrough in Both Performance and Cost

For more About Comparison between IB and ROCE

Latest Posts

Ultra-Low Latency Live Streaming with PTP and VXLAN Multicast in Asterfusion: North America & Japan Case Study Ⅰ

Asterfusion SONiC based 32 x 400G QSFP-DD Data Center Switch Overview

Asterfusion SONiC NOS based 48x1G RJ45 PoE+ layer 3 Enterprise Switch Overview

AsterNOS: Enterprise-ready SONiC NOS for Cloud, AI and Campus

Low Latency Data Center Switch

Campus Access & Aggregation

Wireless Access Point

OpenWiFi Network Controller

Enterprise Ready SONiC NOS

Optical Transceiver

Marvell OCTEON Platform

Network Packet Broker

P4-Programmable Switch

AI Networking

Data Center

Enterprise

Carrier Network

Blogs

Whitepaper

RoCE Beats InfiniBand: Asterfusion 800G Switch Crushes It with 20% Faster TGR

Test Overview

DeepSeek Inference Cluster Test Network

Network topology

Network traffic diagram

Conclusion Overview

Network forwarding and communication performance test

DeepSeek inference scenario network performance test

Understanding Key Performance Metrics Analysis:

1. TTFT (Time To First Token) – First Token Latency

2. P90 ITL (90th Percentile Inter-Token Latency)

3. TGR (Token Generation Rate) – Average Token Output Rate

4. Character Generation Speed

📊 Key Performance Metrics for AI Inference – Comparison Table

Test Data Analysis: Why the Asterfusion RoCE Switch Stands Out

Why TGR Is a Core Indicator of AI Profitability？

Asterfusion 800G AI RoCE Switch: A Breakthrough in Both Performance and Cost

For more About Comparison between IB and ROCE

Latest Posts

Ultra-Low Latency Live Streaming with PTP and VXLAN Multicast in Asterfusion: North America & Japan Case Study Ⅰ

Asterfusion SONiC based 32 x 400G QSFP-DD Data Center Switch Overview

Asterfusion SONiC NOS based 48x1G RJ45 PoE+ layer 3 Enterprise Switch Overview

AsterNOS: Enterprise-ready SONiC NOS for Cloud, AI and Campus