Skip to main content

Advanced AI Technologies Explained: How Lossless Networking and Load Balancing Technologies Power AI Clusters

written by Asterfuison

April 8, 2026

Introduction

The purpose of this article is to reduce complexity and clarify the relationships between DCB and load balancing technologies such as WCMP and packet spray, which are advanced AI Networking technologies . This document does not dive into the internal mechanisms of each technology.

64-port 200G QSFP56 data center switch

64-Port 200G QSFP56 Data Center Switch Enterprise SONiC Ready

Please login to request a quote
64-port OSFP 800GbE switch

800GbE Switch with 64x OSFP Ports, 51.2Tbps, Enterprise SONiC Ready

Please login to request a quote

The advanced AI networking technologies covered in this article include DCB (with support for PFC, PFC watchdog, DCBX, and ECN parameters), as well as load balancing technologies such as WCMP, ARS (Adaptive Routing and Switching), and packet spray, along with INT-driven routing.

Lossless Networking in AIDC Scenarios

To explain these technologies, start with lossless networking. A lossless network guarantees zero packet loss, even under congestion.

As AI clusters scale to tens of thousands of GPUs, maintaining a fully lossless fabric becomes difficult. The industry is pushing the evolution of RoCE.

Two concepts are often confused: DCB and RoCE. They can be understood as follows. DCB (Data Center Bridging) enhances Ethernet. It prepares the network for lossless transport. RoCE runs on top of this enhanced Ethernet, like a high-performance workload.

DCB came first.

In the late 2000s, data centers faced a challenge. Servers used Ethernet for data traffic and Fibre Channel for storage. This increased cost and complexity. The goal was to converge both onto Ethernet. Ethernet is best-effort and allows packet loss. Storage traffic, such as FCoE, cannot tolerate loss. IEEE introduced the DCB suite, including PFC and ETS. The goal was to evolve Ethernet into a lossless transport.

RoCE (RDMA over Converged Ethernet) followed.

RoCE v1 emerged around 2010. It brings RDMA to Ethernet. RDMA was originally designed for InfiniBand. RDMA is sensitive to packet loss. Even minor loss can degrade performance. DCB provides the required lossless behavior. This enables RDMA to run over Ethernet. The relationship between lossless networking and load balancing evolves over time:

Early stage (Tightly coupled): RoCE depends on DCB. Without proper PFC and DCBX configuration, RoCE cannot operate reliably.

Mid stage (Congestion management): DCB alone is not sufficient. PFC can introduce head-of-line blocking and deadlocks. ECN is introduced for congestion signaling. Load balancing mechanisms such as ARS and packet spray are also deployed.

Current stage (Resilient design): The industry is moving toward lossless transport over lossy networks. With advanced load balancing and hardware-assisted retransmission, RoCE is becoming less dependent on strict DCB configuration.

Data Center Bridging in AIDC Networking

load-balancing-technologies-dcb

In practical deployments, DCB is a suite of IEEE standards that enhances Ethernet to create lossless networks. This allows high-performance storage (SAN) and networking (LAN) traffic to share the same infrastructure.

A lossless network is defined and controlled through parameters such as ECN, PFC, PFC Watchdog, and DCBX. Each plays a specific role within the DCB framework:

  • DCBX (IEEE 802.1Qaz): The first step. Switches and NICs exchange configuration information to ensure both sides agree on lossless mode and unified priority settings.
  • PFC (IEEE 802.1Qbb): The last line of defense. When buffers approach capacity, it pauses traffic for specific priorities to prevent packet loss.
  • PFC Watchdog: Addresses PFC side effects, such as potential deadlocks. If a link remains paused too long, the watchdog forces recovery or drops the queue to avoid network-wide stalls.
  • ECN: A proactive mechanism. It marks packets before buffers reach PFC thresholds, signaling senders to slow down and reducing the frequency of PFC triggers.

Since RoCE networks currently cannot be fully separated from DCB, these parameters also directly affect RoCE performance:

  • Performance coupling: RoCE lacks complex congestion window logic. Misconfigured DCB (e.g., mismatched PFC priorities and RoCE flows) can cause massive packet loss under even slight load, triggering retransmissions and increasing latency from microseconds to milliseconds.
  • Configuration linkage: On Asterfusion switches, optimizing RoCE performance essentially means fine-tuning these DCB parameters.

From the above, it is clear how lossless technologies relate to both lossless networks and RoCE deployments.

The next section discusses load balancing in lossless networks.

Key takeaway: Load balancing helps relieve stress on lossless networks.

Without DCB, RoCE performance drops sharply. To further improve RoCE performance, additional load balancing technologies are employed to optimize traffic distribution and maintain low latency. Load Balancing Technologies ultimately lead to a more resilient network infrastructure.

Load Balancing Technologies Relieve Pressure on Lossless Networks

Utilizing advanced Load Balancing Technologies ensures that network resources are utilized effectively, reducing potential bottlenecks.

Load balancing technologies, such as ARS and Packet Spray, help reduce the load on lossless networks:

  • Congestion reduction: Proper load balancing distributes traffic evenly across all paths, reducing the chance of switch buffer overflow.
  • Lower PFC frequency: Ideally, well-chosen paths reduce the need for frequent PFC (pause frames). Frequent PFC triggers increase the risk of deadlocks or head-of-line blocking.

In short, load balancing is a tool, while achieving a lossless network remains the goal. Combined, they ensure stable and high-performance RoCE networks.

load-balancing-technologies-lb

In RoCE networks, load balancing technologies can be classified into three types:

Flow-based Load Balancing

ECMP is the common example. It hashes flows across multiple equal-cost paths, keeping packets of the same flow on a fixed path to preserve order. ECMP improves link utilization but lacks congestion awareness. Uneven traffic or hash collisions can lead to unbalanced path loads.

WCMP extends ECMP by introducing weights. Different links carry traffic proportionally to capacity, bandwidth, or policy.

Packet-based Load Balancing

A typical example is Packet Spray. It distributes individual packets of the same flow across multiple paths using random or round-robin algorithms.

  • Per-packet distribution: Different packets from the same flow may take different paths, quickly utilizing multiple links.
  • Direct spraying: Packet Spray does not rely on telemetry like INT to detect congestion first; it continuously sprays traffic.
  • Aggressive utilization: Packet Spray increases path usage but may cause more packet reordering, requiring receiver-side reassembly or upper-layer tolerance.

Flowlet-based Load Balancing

Example is ARS. The minimal unit is a flowlet — a contiguous packet sequence separated by idle gaps within a flow.

Flowlets allow adaptive path selection per segment, increasing utilization while minimizing reordering.

Conceptually, adaptive load balancing acts like a traffic controller, continuously monitoring path conditions and guiding traffic along the fastest, least congested routes.

load-balancing-technologies-ars

INT-driven Routing

INT (In-band Network Telemetry) embeds per-hop metrics, such as queue occupancy, latency, and path information, within packets. This allows the network to detect congestion in real time.

Asterfusion combines INT with load balancing to implement INT-driven routing, providing feedback for adaptive traffic distribution:

  • Enables Adaptive Load Balancing (ALB) across multiple paths.
  • Driven by real-time link utilization measured via INT.
  • Supports flow-based, flowlet-based, and packet-spray forwarding with WCMP.

Note: ALB refers to adaptive load balancing in general, while ARS is a vendor-specific implementation (e.g., Marvell-defined).

Load balancing technologies play a crucial role in mitigating network congestion and ensuring optimal resource utilization.

By integrating ARS, Packet Spray, and other load balancing technologies at the device level, traffic is dynamically scheduled based on real-time network state. This improves link utilization, reduces congestion probability, and lowers PFC pressure in lossless networks.

Conclusion

DCB provides near-lossless network capability. Without it, RoCE performance drops sharply. To protect lossless behavior, load balancing technologies are required to intelligently schedule traffic and enhance network performance.

Asterfusion integrates INT with load balancing to enable telemetry-driven adaptive routing, improving congestion avoidance in RoCE fabrics. INT-driven routing further enhances load balancing, supporting lossless network goals in large-scale AI data center scenarios.

Need assistance or more info ?

Fill out the form, and we’ll reach out to you today !

Contact US !

Latest Posts