Skip to main content

RoCE Or InfiniBand ?The Most Comprehensive Technical Comparison (II)

written by Asterfuison

October 18, 2024

In this article, we will continue to compare ROCE and InfiniBand on a number of key aspects, including congestion control, QoS, ECMP and more. You will find a comprehensive comparison table at the end of the article.

2. Congestion Control

Congestion control is a primary function of the transport layer, although it requires assistance from the link and network layers. While RoCEv2 uses IB’s transport layer protocol, its congestion control differs.

2.1 RoCEv2 Congestion Control Mechanisms

2.1.1 Priority-based Flow Control (PFC)

In RoCEv2, PFC is used to create a lossless Ethernet environment, ensuring RDMA traffic isn’t lost due to link layer congestion.

2.1.2 Explicit Congestion Notification (ECN)

ECN is a marking mechanism in the IP header for congestion control, allowing network devices to mark packets during congestion instead of dropping them.

RoCEv2 uses ECN bits to mark congested packets. Upon detecting an ECN mark, the receiver sends a Congestion Notification Packet (CNP) to the sender, which then adjusts its transmission rate using congestion control algorithms like DCQCN.

2.1.3 Data Center Quantized Congestion Notification (DCQCN)

DCQCN is a congestion control algorithm for RoCEv2 that combines ECN and rate limiting mechanisms, operating at the transport layer. When the receiver detects an ECN mark, it triggers a CNP to the sender, which adjusts its transmission rate to alleviate congestion.

Similar algorithms include DCTCP, typically implemented on RoCE NICs.

In summary, PFC, ECN, and DCQCN operate at the data link, network, and transport layers, respectively. In RoCEv2, they are used in combination for more efficient congestion management:

  • PFC: Prevents packet loss at the data link layer, providing lossless transmission and addressing a segment of a link issues.
  • ECN/DCQCN: Allows senders to proactively adjust transmission rates based on congestion marks, reducing network load and addressing end-to-end network issues.

2.2 InfiniBand Congestion Control Mechanisms

InfiniBand’s congestion control mechanism consists of three main parts:

2.2.1 Credit-based Flow Control

IB implements credit-based flow control at the link layer, enabling lossless transmission and forming the basis for InfiniBand’s high performance.

2.2.2 ECN Mechanism

When network switches or devices detect congestion, they mark the ECN in the IP header. Upon receiving packets with ECN marks, the receiving Channel Adapter (CA) generates a Congestion Notification Packet (CNP) and sends it back to the sender, notifying them of network congestion and the need to reduce transmission rates.

2.2.3 End-to-End Congestion Control

After receiving a CNP, the sending CA adjusts its transmission rate according to the InfiniBand congestion control algorithm. The sender first reduces its data transmission rate to alleviate congestion, then gradually increases it until another congestion signal is detected. This dynamic adjustment process helps maintain network stability and efficiency. IBA doesn’t define a specific congestion control algorithm; it’s typically customized by vendors.

2.3 Comparison of RoCEv2 and IB Congestion Control Mechanisms

Comparison of the two congestion control mechanisms:

 RoCEv2InfiniBand
Link LayerPacket-based Flow ControlCredit-based Flow Control
Network LayerECN/CNPECN/CNP
Transport LayerDCQCNVendor-specific Congestion Control

RoCE and IB congestion control mechanisms are fundamentally similar. The main difference is that IB’s mechanism is more integrated, typically provided by a single vendor offering a complete suite of products from NICs to switches. This vendor lock-in results in higher prices. RoCE’s congestion control mechanism is based on open protocols, allowing interoperability between NICs and switches from different vendors, but with more complex configuration and management. Each has its pros and cons.

As large-scale AI training and inference clusters expand, collective communication traffic has led to increasingly severe congestion control problems. This has given rise to new congestion control technologies such as HPCC (High Precision Congestion Control) based on In-band Network Telemetry (INT), and Receiver-driven traffic admission based on Clear-to-Send (CTS). These new technologies are easier to implement on open Ethernet/IP networks.

3. QoS

In RDMA networks, not only does RDMA traffic need priority guarantees, but some control packets like CNP, INT, and CTS also require special treatment to ensure lossless and prioritized transmission of these control signals.

RoCEv2 QoS

At the link layer, RoCEv2 uses the ETS mechanism to assign different priorities to different traffic types, providing bandwidth guarantees for each priority.

At the network layer, RoCEv2 uses DSCP combined with PQ, WFQ, and other queuing mechanisms to assign different priorities and bandwidths to different traffic types, achieving more fine-grained QoS.

InfiniBand QoS

At the link layer, IB uses SL, VL, and the mapping between them to assign high-priority traffic to dedicated VLs for priority transmission. Although the VL Arbitration Table can influence and control bandwidth allocation by assigning different weights, this method cannot guarantee bandwidth for each VL.

At the network layer, IB’s GRH supports an 8-bit Traffic Class field to provide different priorities when crossing subnets, but similarly cannot guarantee bandwidth.

It’s evident that RoCE can provide more fine-grained QoS guarantees and bandwidth control for different traffic types, while InfiniBand leans more towards providing priority scheduling rather than explicit bandwidth guarantees.

4. ECMP

4.1 RoCE ECMP

Data center IP networks typically adopt architectures like Spine-Leaf for high reliability and scalability. They usually provide multiple equivalent paths between a pair of RoCE NICs. To achieve load balancing and improve network topology utilization, they use ECMP (Equal Cost Multiple Paths) technology. For a given packet, RoCE switches use hash values on certain packet fields to choose among possible equivalent paths. Due to reliable transmission requirements, the same RDMA operation should maintain the same path to avoid out-of-order issues caused by different paths.

In IP networks, protocols like BGP/OSPF can calculate equivalent paths on any topology, then switch data planes calculate hash values based on IP/UDP/TCP header fields (such as the five-tuple) and forward to different paths in turn. In RoCE networks, to further subdivide RDMA operations, the destination QP information in the BTH header can be further identified to implement more fine-grained ECMP.

4.2 InfiniBand ECMP

In the control plane, IB routing is based on subnet managers, implementing ECMP on the basis of topology discovery. However, due to the centralized subnet manager being separate from network devices, it may not be able to sense network topology changes in a timely manner, thus failing to achieve dynamic load balancing.

In the data plane, IB’s ECMP is similarly based on hash calculation and round-robin mechanisms.

5. RoCE and InfiniBand Technical Comparison

Based on the above analysis, the following table summarizes the main similarities, differences, and strengths of the RoCE and IB protocol stacks.

RoCEv2IB
FeaturesScoreFeaturesScore
Physic LayerFiber/Copper QSFP/OSFP PAM4 64/66b★★★★★Fiber/Copper QSFP/OSFP NRZ 64/66b★★★★☆
Link LayerEthernet PFC ETS★★★★☆IB Link Layer CREDIT-based Flow Control SL + VL★★★★★
Network LayerIPv4/IPv6/SRv6 BGP/OSPF★★★★★IB Network Layer Subnet Manager★★★★☆
Transport LayerIB Transport Layer★★★★★IB Transport Layer★★★★★
Congestion ControlPFC ECN DCQCN★★★★☆CREDIT-based Flow Control ECN Vendor-specific Algorism★★★★☆
QoSETS DSCP★★★★★SL + VL Traffic Class★★★★☆
ECMPHash-based Load balance Round-robin QP aware★★★★★Hash-based Load balance Round-robin★★★★★
  • Physical Layer: Both RoCE and IB support 800G, but PAM4 has stronger upgrade potential compared to NRZ. Ethernet also has lower costs than IB, giving RoCE an edge.
  • Link Layer: Both achieve lossless transmission. RoCE’s ETS can provide bandwidth guarantees for traffic of different priorities, while IB switches typically have lower forwarding latency.
  • Network Layer: RoCE leverages the mature and continuously developing IP, making it more adaptable to large-scale networks.
  • Transport Layer and above: RoCE and IB use the same protocols, with no difference.
  • Congestion Control: RoCE combines PFC, ECN, and DCQCN to provide an open solution, while IB has a highly integrated solution based on Credit. Both have pros and cons but are inadequate in dealing with large-scale collective communication traffic.
  • QoS: RoCE can guarantee bandwidth for each priority, while IB can only achieve priority forwarding for high-level traffic.
  • ECMP: Both implement hash-based load sharing.

In conclusion, RoCE and InfiniBand, both defined by IBTA, don’t have fundamental differences. RoCE essentially ports the mature IB transport layer and RDMA to equally mature Ethernet and IP networks, combining strengths while maintaining high performance and reducing RDMA network costs, making it adaptable to larger-scale networks.

Moreover, while both have strengths in flow control and congestion management, they are inadequate in dealing with high-bandwidth, bursty, and broadcast-type collective communication traffic in large-scale AI training and inference. Thoroughly improving these issues awaits the introduction of next-generation protocols, such as Ultra Ethernet.

World’s Lowest Latency RoCEv2 Ultra Ethernet Switches-Asterfsuion CX-N Series

The Asterfusion RoCEv2 Ultra Ethernet Switch is the ideal choice for AI/ML and HPC applications. It supports ports ranging from an impressive 25G to a staggering 800G.

This is the world’s lowest latency RoCEv2 switch, setting a new benchmark in the realm of data transfer.

At the heart of Asterfusion is Marvell’s Teralynx engine. This powerful component, when combined with Asterfusion’s enterprise SONiC RoCEv2, delivers an unbeatable solution that is not just efficient, but also supremely cost-effective, challenging the monopoly of Infiniband.

The Marvell Teralynx7/10 series, a crucial part of this ensemble, is the undisputed speed champion of the world. With an incredible port-to-port latency of fewer than 400 nanoseconds (Teralynx7)and /560 nanoseconds (Teralynx10), it leaves the competition far behind.

But Asterfusion is more than just its parts. It represents the future of connectivity. It offers customers an open, dynamic platform that combines low latency and high bandwidth of Ethernet. This effectively addresses the limitations of Infiniband’s closed, proprietary hardware. With Asterfusion, we’re not just changing the game. We’re redefining it.

For more Test Result Beween IB and ROCE

On-site Test Result: Asterfusion RoCEv2-enabled SONiC Switches Vs. InfiniBand

Reference:

[1] https://www.infinibandta.org/about-infiniband/

[1] https://www.theinformation.com/articles/meta-will-soon-get-a-100-000-gpu-cluster-too-whats-life-at-character-like-now

[1] https://top500.org/statistics/list/

[1] Supplement to InfiniBand Architecture Specification Volume 1 Release 1.2.1, Annex A17: RoCEv2

[1] https://dl.acm.org/doi/10.1145/3341302.3342085

[1] https://engineering.fb.com/2024/08/05/data-center-engineering/roce-network-distributed-ai-training-at-scale/

Latest Posts