Skip to main content

RoCE vs. InfiniBand: Shocking Data Center Switch Test Results You Need to See!

written by Asterfuison

January 29, 2025

In AI, HPC, and other high-performance, lossless networks, people often discuss RoCE (RDMA over Converged Ethernet) and InfiniBand (IB). We have detailed the differences and relationships between these two protocol stacks(see: Comparison and Analysis of RoCE vs. IB Protocol Stacks). RoCE adapts the mature InfiniBand transport layer and RDMA to Ethernet and IP networks.

InfiniBand was the first protocol to support RDMA. It has a head start and a mature tech stack. It has a full suite of specialized hardware and software. It delivers very low-latency transmission. However, its reliance on a single vendor often leads to a higher Total Cost of Ownership (TCO).

On the other hand, RoCEv2 is cheaper and more interoperable. So, it is better for large-scale deployments. For instance, xAI’s AI cluster in Memphis, USA, has over 100,000 cards. It uses a 400GbE Ethernet-based, lossless, high-speed network.

Can Open Network Switch Replace InfiniBand?

Quoting Amazon Senior Principal Engineer Brian Barret, the reasons why AWS abandoned the InfiniBand solution are clear:

“A dedicated InfiniBand network cluster is like an isolated island in a vast ocean. It meets the needs for resource scheduling, sharing, and other elastic deployments.”

The industry is shifting to standard openness and multi-vendor compatibility. Open networks like SONiC have been deployed in large cloud environments. This trend is hard to ignore. This raises an important question: Can open networks using RoCE be a viable alternative to InfiniBand in high-performance AI/HPC scenarios?

Moreover, could the flexibility of open architectures provide broader benefits to data center operators? For instance, could they simplify the complex RoCE network setup and tuning? Could they enhance operational and diagnostic capabilities? And could they offer even more possibilities?……

Test Background

We tested SONiC switches (RoCE) against InfiniBand (IB) in three scenarios: AI training, HPC, and distributed storage. The results are presented with precision, and the testing methods are briefly outlined to provide valuable insights for our readers.

  • AI Intelligent Computing Scenario: E2E forwarding tests, NCCL-TEST, and large model training network tests
  • HPC Scenario: E2E forwarding performance, MPI, Linpack, and HPC applications (WRF, LAMMPS, VASP)
  • Distributed Storage: FIO tool used to measure read and write performance under stress

The RoCE switch tested is the Asterfusion CX-N series, which features an ultra-low latency hardware platform and is equipped with the enterprise-grade SONiC distribution AsterNOS. All ports support RoCEv2 natively and are paired with the EasyRoCE Toolkit.

CX-N series model specifications are shown in the following table:

Model typeinterface
CX864E-N64 x 800GE OSFP,2 x 10GE SFP+
CX732Q-N32 x 400GE QSFP-DD, 2 x 10GE SFP+
CX664D-N64 x 200GE QSFP56, 2 x 10GE SFP+
CX564P-N64 x 100GE QSFP28, 2 x 10GE SFP+
CX532P-N32 x 100GE QSFP28, 2 x 10GE SFP+

(10G interface can be used exclusively as INT)

The network operating system architecture installed on the switch:

AsterNOS-feature

Test Conclusion

The open architecture Asterfusion CX-N series switch (RoCE) provides end-to-end performance that matches InfiniBand (IB) switches, and in some cases, even outperforms them.

AI Intelligent Computing Scenario

  1. The E2E forwarding bandwidth matches the NIC’s max rate. The latency is as low as 560ns on a single machine.
  2. Running the NCCL-test (Ring algorithm) with two machines and 16 cards, the maximum total bus bandwidth measured through the two tested switches is nearly identical to the IB switch (around 195GBps). The bandwidth usage matches the NIC direct connection. It reached the max speed for server scale-out.
  3. With an optimized topology, the training time for Llama2-7B (2048 token length) on two machines with 16 cards matches the results from both NIC direct connection and IB networking.

HPC Scenario

  1. The E2E latency performance is almost identical to that of the IB switch, with differences staying within the nanosecond range.
  2. In the MPI benchmark test, E2E performance closely matches the IB switch, with latency differences also in the nanosecond range.
  3. Linpack efficiency is nearly the same as when using an IB switch of the same specification, with differences of around 0.2%.
  4. When running WRF, LAMMPS, and VASP applications in parallel within an HPC cluster, the average time to complete the same computational tasks using the RoCE switch is comparable to that of the IB switch, with differences ranging from 0.5% to 3%.
weather-research-and-forecasting-models
large-scale-atomicmolecular-massively-parallel-simulator

Distributed Storage Scenario

The Input/Output Operations Per Second(IOPS) of the distributed storage system using a RoCE network is comparable to that of the same-spec InfiniBand network, and under certain conditions, it even outperforms InfiniBand.

RoCE-switch-vs-IB-switch-in-distributed-storage-scenario

Appendix: Testing Method and Main Steps

AI Intelligent Computing Scenario

The network is set up according to the topology shown below. The Mellanox MLNX_OFED driver is installed on the Server machines, and upon installation, the IB testing tool suite (including tools such as ib_read_lat, ib_send_lat, ib_write_lat) is automatically included. Using this tool suite, Server 1 acts as the sender, while Server 0 serves as the receiver. The tests measure the forwarding link bandwidth, latency, and packet loss through the switches.

  • The switch port forwarding latency is calculated as the difference between the total communication latency and the NIC direct connection forwarding latency.
  • NCCL-test is run using the mpirun command with all_reduce_perf. The measured bus bandwidth increases with the data volume, and the maximum value observed is used as the test result.
ai-intelligent-computing-sonic-switch-networking-topology

Enabling GPU Direct RDMA

Load the nv_peer_mem kernel module.

Manual compilation method:

ACSCtl check

Disable the PCIe ACS configuration, not disabling it will cause the NCCL performance limit bandwidth to not reach the expected value.

RDMA-related configuration

Loading the RDMA Sharp Acceleration Library

HPC Scenario

hpc-sonic-switch-networking-topology

The E2E forwarding performance testing method is similar to the one mentioned above and will not be repeated here.

MPI Benchmark Test Environment Deployment

  • The MPI test is divided into point-to-point communication and network communication, covering data sizes from 2 to 8,388,608 bytes.

HPC Application Deployment

  • The test results represent the time taken by both the test device and the control group to complete the same computational task. The final result is the average time taken from three consecutive tests.

To deploy the WRF (Weather Research and Forecasting) simulation, the server must first have the compiler installed and the basic environment variables configured. Next, third-party libraries such as zlib, libpng, mpich, jasper, and netcdf need to be compiled, followed by testing the dependency libraries. Afterward, the WRF application and WPS (WRF Preprocessing System) are installed. Finally, the executable file is generated.

Before running LAMMPS, GCC, OpenMPI, and FFTW need to be compiled and installed.

The control group InfiniBand switch is the Mellanox MSB7800.

Distributed Storage

distributed-storage-scenario-sonic-switch-networking-topology

The FIO tool (Flexible IO Tester) is used to stress-test the distributed storage system. The peak value, calculated from IOPS x IO Size, represents the system’s maximum throughput capacity.

IOPS is measured through random read and random write tests, using different block sizes (4k/8k/1024k).

About Asterfusion

Asterfusion is a leading innovator in open-network solutions, offering cutting-edge products like 1G-800G SONiC-based open network switches, DPU NICs, and P4-programmable hardware. We provide fully integrated, ready-to-deploy solutions for data centers, enterprises, and cloud providers. With our flexible, decoupled hardware and our self-developed enterprise SONiC distribution, we empower customers to build transparent, easy-to-manage, and cost-efficient networks—designed to thrive in an AI-driven future.

Latest Posts