RoCE Or InfiniBand ?The Most Comprehensive Technical Comparison (Ⅰ)
written by Asterfuison
Table of Contents
RoCE (RDMA over Converged Ethernet) and InfiniBand are both advanced network protocol stacks developed by the InfiniBand Trade Association (IBTA). They are designed to provide high bandwidth, low latency, and reliable communication. While RoCE operates over Ethernet, InfiniBand utilizes its own fabric. Both protocols support reliable message transmission and memory operation semantics, including Remote Direct Memory Access (RDMA).
RoCE &InfiniBand: Their Evolution and Current Landscape
Founded in 1999, the IBTA initially established the InfiniBand Architecture (IBA) specification, which allows for data movement without software intervention. InfiniBand has been widely adopted in data centers, especially in High-Performance Computing (HPC) and Artificial Intelligence (AI) clusters, effectively scaling to thousands of nodes. The introduction of RoCE in 2010 aimed to enhance RDMA capabilities by merging InfiniBand’s efficiency with Ethernet’s widespread use
RoCE retains the transport layer and RDMA protocol from InfiniBand, despite differences in the link and network layers. This convergence has led to significant adoption in AI and HPC environments, with some clusters scaling up to 100,000 nodes. According to recent statistics, InfiniBand accounts for 47.8% of systems in the world’s TOP500 supercomputers, while RoCE holds 39%. In terms of total port bandwidth, RoCE leads with 48.5%, compared to InfiniBand’s 39.2%
In this series of 3 articles, we will delve into a detailed comparison of RoCE and InfiniBand, focusing on the following
- Physical Layer
- Link Layer
- Network Layer
- Transport Layer
- Congestion Control
- QoS
- ECMP
By the end of this exploration, readers will gain a comprehensive understanding of both technologies and their respective advantages in modern data center architectures.
In this article, we will focus on comparing RoCE and InfiniBand specifically from the perspectives of the Physical Layer, Link Layer, Network Layer, and Transport Layer. Understanding these layers is crucial for grasping how each technology operates and their respective advantages in networking environments.
1.1 Protocol Stack
RoCE is divided into two versions: v1 and v2. The comparison between them and the IB protocol stack is as follows:
In RoCEv1, Ethernet replaces the link layer of IB. In RoCEv2, IP replaces the network layer of IB, so it is also called IP routable RoCE. The three protocol stacks are the same above the transport layer.
1.2 Message Format
- The IB message format is shown in the figure below.
Within a subnet, only the Local Routing Header (LRH) is used, which corresponds to the link layer in the OSI model.
Between subnets, there is also a Global Routing Header (GRH), which corresponds to the network layer in the OSI model.
Above the Routing Header is the Transport Header, representing the transport layer.
Finally, there is a Cyclic Redundancy Check (CRC) for both fixed and variable fields.
- The RoCE message format is shown in the figure below.
Among them, RoCEv1 uses the IB’s Global Routing Header, IB BTH is the IB’s Base Transport Header, ICRC is a cyclic redundancy check code for checking the unchanged fields of the InfiniBand layer, and FCS is the check sequence code of the Ethernet link layer.
1.3 Physical Layer
- RoCE’s physical layer is based on standard Ethernet, using PAM4 (Pulse Amplitude Modulation 4) encoding and 64/66b encoding. It supports both copper cables and optical fibers, with interfaces such as SFP+, QSFP+, and OSFP. The supported rates range from 10GbE to 800GbE.
- InfiniBand’s (IB) physical layer is proprietary, using more traditional NRZ (Non-Return-to-Zero) modulation and 64/66b encoding. It also supports copper cables and optical fibers, with interfaces typically being QSFP and OSFP, and rates ranging from 10Gbps to 400Gbps. Higher total bandwidths (such as 800Gbps) can be achieved through multi-channel aggregation.
PAM4 utilizes four distinct voltage levels to represent data. Compared to NRZ’s binary modulation (which uses only two levels per symbol), PAM4 transmits two bits of data per symbol cycle, effectively doubling the bandwidth efficiency. This makes PAM4 advantageous in supporting higher rates (such as 1.6T, 3.2T), offering potential benefits for future ultra-high-speed data transmission.
At the physical layer, both RoCE and IB support 800G.However, PAM4 is the superior option compared to NRZ, and Ethernet is the more cost-effective choice. RoCE is the clear winner.
1.4 Link Layer
- The link layer of RoCE is standard Ethernet. In order to achieve lossless transmission on traditional Ethernet, PFC (Priority-based Flow Control) is introduced, which is defined by the IEEE 802.1Qbb standard. When the buffer of a priority queue of the switch is close to full, a PFC frame will be sent to the upstream device to notify it to suspend sending the traffic of that priority, prevent buffer overflow, and avoid data packets being discarded at the link layer. In addition, Ethernet introduces ETS (Enhanced Transmission Selection), which is part of the DCB (Data Center Bridging) standard and is defined by the IEEE 802.1Qaz specification. ETS distributes traffic to different queues, assigns a weight to each queue, controls the bandwidth percentage that each traffic queue can use, and ensures that high-priority traffic, such as RDMA, obtains sufficient bandwidth resources.
- The link layer of IB is proprietary, and the packet header is called Local Routing Header, as shown in the figure.
VL refers to Virtual Lanes, SL stands for Service Level, and Source/Destination Local Identifier represents the link-layer address. It has built-in support for lossless transmission because it implements credit-based flow control. The receiver provides a credit value on each link, indicating the amount of data its buffer can handle. The sender transmits data based on this credit value, ensuring that it does not exceed the receiver’s capacity, thereby avoiding buffer overflow and data loss.
The IB link layer integrates SL and VL to achieve QoS (Quality of Service). SL has 16 service levels used to identify traffic priorities, and each data packet can be assigned to different service levels based on business needs. Through SL-VL mapping, traffic of different priorities is allocated to different VLs, ensuring that high-priority traffic (such as RDMA) is not affected by congestion from low-priority traffic.
The IB link layer is implemented through dedicated hardware, making it highly efficient with ultra-low latency. RoCE, based on standard Ethernet hardware, has slightly higher latency. However, since both reach the 100ns level, there is little difference in end-to-end performance when transmitting RDMA.
The link layer of IB is implemented by dedicated hardware, which is highly efficient and has the characteristics of ultra-low latency. RoCE is based on standard Ethernet hardware and has a slightly longer latency, However, as they both reach the 100ns level, and the end-to-end performance requirement for RDMA transfers is typically around 10µs, the difference between them is not great.
At the link layer, both achieve lossless transmission. RoCE’s ETS provides unparalleled bandwidth guarantees for traffic with different priorities, while IB switches have lower forwarding latency, giving the IB a clear advantage.
1.5 Network Layer
- RoCE uses IP at the network layer, which can be either IPv4 or IPv6. It leverages mature routing protocols like BGP and OSPF, making it adaptable to any network topology with fast self-healing capabilities. It supports ECN (Explicit Congestion Notification) for end-to-end congestion control and DSCP as a replacement for IB’s Traffic Class to implement QoS.
- InfiniBand’s (IB) network layer is inspired by IPv6. The format of the Global Routing Header is identical to IPv6, with a 128-bit address, but the field names are different. However, it does not define routing protocols and instead uses a Subnet Manager to handle routing. The Subnet Manager is a centralized server that collects network topology and computes routes. When crossing subnets, routers handle address mapping and path selection, with the exact method defined by the vendor, lacking interoperability.
Clearly, RoCE’s network layer is widely used across billions of devices on the internet, with mature technology that continues to evolve. With the introduction of technologies like SRv6, IP further enhances capabilities in traffic engineering, service chaining, flexibility, and scalability, making it ideal for building large-scale, self-healing RDMA networks.
At the network layer, RoCE is the clear choice for large-scale networks. It leverages the maturity of IP as it continues to evolve, offering more advantages than IB
1.6 Transport Layer
- 1.6.1 RoCE
RoCE adopts IB’s transport layer. Although the RoCEv2 protocol stack includes UDP, it only borrows UDP’s encapsulation format, while connection management, retransmission, and congestion control functions are handled by the IB transport layer. The UDP layer’s destination port is fixed for the RDMA protocol, while the source port is dynamically assigned but remains fixed during a connection. This allows network devices to distinguish between different RDMA data flows by their source port.
- 1.6.2 InfiniBand (IB)
The IB transport layer uses a modular and flexible design, typically consisting of a Base Transport Header (BTH) and one or more (0 to multiple) Extended Transport Headers (ETH).
- BTH (Base Transport Header) is part of the IB transport layer header. It represents the basic header at Layer 4 (L4) of the IB network protocol and is used to describe control information for data transmission. Its format is:
Key information includes:
- OpCode (Operation Code): Composed of 8 bits. The first 3 bits represent the transport service type, such as reliable connection, unreliable connection, reliable datagram, unreliable datagram, and RAW datagram. The remaining 5 bits indicate the operation type, such as SEND, READ, WRITE, ACK, etc.
- Destination QP (Queue Pair Number): Similar to a TCP port number, it represents the destination of an RDMA connection (referred to as a Channel). Unlike TCP ports, the QP consists of both Send and Recv queues but is identified by a single number.
- Packet Sequence Number (PSN): Similar to the TCP sequence number, it is used to check the order of transmitted packets.
- Partition Key: Used to divide an RDMA network into multiple logical partitions. In RoCE, newer technologies like VxLAN can be used as a replacement.
- ECN (Explicit Congestion Notification): Used for congestion control and includes two bits, Forward and Backward, which indicate congestion encountered on the forward and return paths, respectively. In RoCE, this is replaced by the ECN field in the IP header.
BTH helps the receiver understand which connection the packet belongs to and how to process it, including verifying packet order and identifying operation types.
After the BTH comes the RDMA Extended Transport Header, which contains information like the remote virtual address, key, and data length. Its format is:
- Virtual Address: Represents the memory address in the destination.
- DMA Length: The length of the data to be read or written, in bytes.
- Remote Key: A key used to access remote memory.
The IB transport layer is usually implemented by RDMA NIC hardware, called a Channel Adapter (CA) in IB and a RoCE NIC in RoCE, enhancing RDMA transmission performance. In some advanced RoCE switches, IB transport layer information can also be recognized and used to accelerate RDMA data flow processing.
Transport layer and above, RoCE and IB use the same protocols and there is no difference.
1.7 RDMA Operations
With the help of the RDMA Extended Transport Header, RoCE and IB’s transport layers enable direct read and write operations on the remote host’s address.
- RDMA Write Operation
Once a Queue Pair (QP) is established, RDMA Write can be performed directly, allowing the sender to write directly to the receiver’s memory without involving the receiver’s CPU or requiring any request. This is one of the core features of RDMA that ensures high performance and low latency.
RDMA Write is a unidirectional operation. After writing the data, the sender does not need to wait for a response from the receiver. Unlike the traditional Send/Receive model, it does not require the receiver to prepare a receive queue in advance.
- RDMA Read Operation
RDMA Read allows the sender to read data directly from the receiver’s memory without involving the receiver’s CPU. The target address and data size are specified by the sender. As shown in the figure below, a single request can result in multiple responses to return the data, improving data transmission efficiency.
- Send/Receive Operation
This is the traditional message-passing operation, where data is transmitted from the sender to the receiver’s receive queue, requiring the receiver to prepare the queue in advance.
In RoCE, RDMA bypasses the operating system’s TCP/IP stack and connects directly to the transport layer on the RoCE NIC. Using the DMA mechanism, it can access both local and remote memory directly, achieving zero-copy transmission, significantly improving performance.
Similarly, IB NICs implement RDMA operations in hardware, providing zero-copy transmission, with both technologies offering comparable performance.
However, in both RoCE and IB, the initialization of RDMA connections, resource allocation, Queue Pair (QP) management, and certain control path operations (such as connection establishment and memory registration) still rely on the software stack.
1.8 Application Layer
RDMA has seen widespread use in data centers, HPC clusters, and supercomputers, where it supports critical internal data center operations such as AI training, inference, and distributed storage.
For example, during AI training/inference, libraries like xCCL or MPI use RDMA for point-to-point and collective communication. In distributed storage, technologies like NVMe-oF and Ceph use RDMA to perform read and write operations on networked storage.
We will stop here today. For a definitive technical comparison, see the next article: RoCE or InfiniBand? The Most Comprehensive Technical Comparison (II).