Skip to main content

Fully Adaptive Routing Ethernet using BGP
Technology White Paper

1. Background

As trillion-parameter large language models (LLM) become mainstream, AI compute clusters are rapidly scaling to tens of thousands or even hundreds of thousands of GPUs. In highly synchronized distributed AI training workloads, such as All-Reduce and All-to-All collective communication, the underlying network must deliver not only massive aggregate bandwidth, but also ultra-low latency, lossless forwarding, and fast, reliable failure convergence.

Traditional Spine-Leaf fabrics built on BGP with ECMP (Equal-Cost Multi-Path) load balancing were designed for conventional traffic patterns. Under today’s high-performance AI workloads, however, their traffic distribution mechanisms are reaching their limits. This creates a strong need for a new approach to link awareness and traffic scheduling at the network layer. BGP FARE (Full Adaptive Routing Ethernet, an IETF draft) is an emerging technology developed to address these challenges.

1.1 Scheduling Limitations of Static Hash-Based Load Balancing

In traditional large-scale data center networks, multipath forwarding primarily relies on ECMP. Its core mechanism is per-flow load balancing. The switch ASIC extracts packet header fields, such as the source IP, destination IP, source port, destination port, and protocol number, as hash factor. A static hash algorithm generates a hash value, which is then mapped to one of the ECMP member links using a modulo operation. As a result, packets belonging to the same flow are always forwarded through the same egress interface.

This static hashing mechanism performs well in traditional Internet and cloud computing environments, where traffic consists of a large number of independent, short-lived flows. At scale, these flows are statistically distributed across available paths, resulting in relatively balanced link utilization. However, under the high-performance communication patterns of AI compute clusters, ECMP exposes several critical limitations:

  • Elephant flow collisions and localized congestion: AI training workloads exchange large volumes of model parameters, producing a small number of extremely high-bandwidth elephant flows. Because static hashing is unaware of the real-time utilization or available bandwidth of each physical link, multiple elephant flows can be mapped to the same link. This creates severe microbursts, leading to buffer overflow and packet loss. At the same time, other ECMP member links may remain underutilized, reducing overall network bandwidth efficiency.
  • Rigid traffic distribution in asymmetric topologies: AI data center networks are typically expanded in phases, with heterogeneous hardware deployed over time. In addition, fiber degradation, optical module failures, or link impairments can reduce link bandwidth or take links offline. These conditions create asymmetric network topologies. Traditional ECMP uses a fixed 1:1 traffic distribution model across equal-cost paths. Even when a link becomes degraded, the ASIC continues forwarding the same proportion of traffic to that path, increasing congestion and the risk of packet loss.

1.2 Lack of Bottleneck Awareness in Asymmetric Networks

In multi-tier Clos fabrics built with hundreds or thousands of switches, bandwidth bottlenecks can occur at any layer of the network, including Leaf-to-Spine and Spine-to-SuperSpine links. Traditional routing protocols advertise only reachability information and abstract routing metrics (Metric/Cost). They provide no mechanism to propagate bandwidth constraints hop by hop across an end-to-end path. This creates two major limitations:

  • No end-to-end bandwidth bottleneck awareness: Edge devices, such as Leaf switches, have no visibility into where the bandwidth bottleneck exists along a multi-hop path. They cannot determine which node or link limits the available bandwidth between the source and destination. Without this end-to-end bottleneck awareness, edge switches cannot make informed path selection or traffic scheduling decisions.
  • Route instability caused by real-time traffic engineering: To address this limitation, the industry has introduced various Traffic Engineering (TE) techniques based on real-time link utilization, queue depth, or forwarding latency. However, AI workloads can change at microsecond timescales. Continuously adjusting routing weights based on instantaneous bandwidth utilization, queue occupancy, or forwarding delay places significant pressure on the control plane. Frequent route recalculation and policy updates consume substantial CPU resources and can trigger route oscillation. The resulting latency fluctuations directly impact distributed AI training, where the slowest node determines the completion time of the entire synchronization operation. This leaves GPU resources idle and reduces overall cluster efficiency.

1.3 Emergence of BGP FARE

To overcome the limitations of traditional static load balancing and avoid route oscillations introduced by dynamic TE approaches, BGP FARE was developed. It abandons complex schemes that rely on real-time metrics such as link utilization or queueing delay for continuous weight tuning. Instead, it adopts a simpler design principle based on physical line-rate capacity, efficient control-plane convergence, and strict data-plane load balancing.

  • WCMP based on physical line-rate capacity: In the control plane, BGP FARE focuses only on the line-rate bandwidth of physical links. Bandwidth constraints are propagated hop by hop across the network, using a strict minimum-value propagation model. This allows edge devices to identify the actual bottleneck node along an end-to-end path. Based on this deterministic bandwidth profile, BGP installs Weighted-Cost Multi-Path (WCMP) forwarding entries with corresponding weights. These weights remain stable and do not change frequently with real-time utilization or queue depth, which eliminates route oscillations in the control plane.
  • Extreme load balancing with packet spraying: In the data plane, BGP FARE can leverage fine-grained Packet Spray techniques. Packets are distributed across all available links at the smallest granularity. This maximizes bandwidth utilization and removes the hash polarization problem inherent in per-flow load balancing.
  • Efficient failure convergence via extended community attributes: BGP FARE extends BGP using Path Bandwidth extended community attributes. Bandwidth awareness, route computation, link capacity updates, and failure convergence are handled in a unified manner. No additional out-of-band probing is required. This provides strong convergence behavior and high scalability.

Through coordination between bandwidth-aware control-plane computation and packet-level load balancing in the data plane, BGP FARE enables a high-throughput, evenly balanced, and fast self-healing fabric for large-scale AI compute clusters and HPC environments.

2. How does Fully Adaptive Routing Ethernet using BGP Work?

2.1 Basic Concepts

The following terms are used throughout the BGP FARE architecture:

Term
Definition
WCMP
A multipath forwarding mechanism that assigns different forwarding weights to unequal-cost paths in hardware forwarding tables.
Packet Spray
A hardware forwarding technique that ignores the five-tuple of a flow and distributes consecutive packets across all available member links on a per-packet basis, typically using round-robin or randomized scheduling.
Path Bandwidth
An optional transitive or non-transitive BGP Extended Community attribute used to explicitly advertise the maximum available bandwidth of a route prefix along the current path between BGP peers.

Table 1. BGP FARE Terms and Definitions

2.2 How BGP FARE Processes and Uses Bandwidth Information

2.2.1 Path Bandwidth Propagation Across Multiple Hops

BGP FARE uses the Path Bandwidth extended community attribute to propagate bandwidth constraints hop by hop across the network. As a route to a destination prefix is advertised from the source and traverses multiple BGP nodes, each node updates the Path Bandwidth attribute according to the available forwarding capacity.

The processing rules are as follows:

At the originating node:

Before advertising a route, the originating node sets the Path Bandwidth extended community attribute to the line-rate bandwidth of the outgoing interface used to advertise the route.

At each intermediate node:

When an intermediate node receives advertisements for the same destination prefix from multiple upstream peers, each route carries a Path Bandwidth (PathBW) value representing the bandwidth propagated from the corresponding incoming path.

The node performs the following steps:

  1. Sum the Path Bandwidth values from all received route advertisements: PathBWsum = Σ PathBW
  2. Exclude the incoming interfaces that received the route advertisements. Count the number of eligible outgoing interfaces (N) used to advertise the route.
  3. Calculate the average bandwidth allocated to each outgoing path: PathBWAvg = PathBWsum / N
  4. Before advertising the route, update the Path Bandwidth Extended Community attribute with the result of min(PathBWAvg, PathBW)
      Fully adaptive routing ethernet algorithm
      1. Bandwidth Aggregation and Distribution at Intermediate Nodes

      Spine 1 receives route advertisements from Leaf 1 with a cumulative path bandwidth of: PathBWSum1 = L1 + L3 = 50.0 + 100.0 = 150.0 Gbps
      After excluding the incoming interface, there are N = 2 eligible outgoing interfaces. The average bandwidth is calculated as: PathBWAvg1 = 150.0 / 2 = 75.0 Gbps
      Spine 2 receives route advertisements from Leaf 1 with a cumulative path bandwidth of: PathBWSum2 = L2 + L4 = 100.0 + 100.0 = 200.0 Gbps
      After excluding the incoming interface, there are N = 2 eligible outgoing interfaces. The average bandwidth is calculated as: PathBWAvg2 = 200.0 / 2 = 100.0 Gbps

       

      1. Bandwidth Constraints Advertised to Downstream Nodes
        The Path Bandwidth attribute advertised on each outgoing link is calculated as follows:

      L5: min(PathBWAvg1, L5) = min(75.0, 100.0) = 75.0 Gbps
      L6: min(PathBWAvg1, L6) = min(75.0, 100.0) = 75.0 Gbps
      L7: min(PathBWAvg2, L7) = min(100.0, 100.0) = 100.0 Gbps
      L8: min(PathBWAvg2, L8) = min(100.0, 100.0) = 100.0 Gbps

      Assume a route traverses a node sequence P = {N, N, , N}, where N is the originator and N is the destination edge node. For any two adjacent nodes Nᵢ → Nᵢ₊₁, the outbound Path Bandwidth is recursively defined as:

      BOutbound(Nᵢ → Nᵢ₊₁) = min(BInbound(Nᵢ₋₁ → Nᵢ), BLocal_Physical_Link(Nᵢ → Nᵢ₊₁))

      Where:

      • BInbound: The averaged available bandwidth derived from all upstream nodes. It is computed as the sum of all previously constrained bandwidth values received from upstream neighbors, divided by the number of local outbound interfaces used to advertise the route.
      • BLocal_Physical_Link: The line-rate bandwidth of the local outgoing physical interface at node Nᵢ.

      With this hop-by-hop minimum-value propagation model, bandwidth constraints are validated and propagated at every node. As the route information reaches the network edge, the bottleneck bandwidth has already been reflected in the Path Bandwidth attribute. This enables every node to maintain a deterministic bandwidth constraint for every reachable destination.

      2.2.2 WCMP Forwarding Weights Based on Path Bandwidth

      After an edge node learns the end-to-end bandwidth of all available paths, the control plane converts the bandwidth information into WCMP forwarding weights for the hardware forwarding table.

      Assume a destination prefix has M reachable next-hop paths. The control plane extracts the bandwidth of each path after hop-by-hop constraint propagation, denoted as B, B, …, B. The forwarding weight Wᵢ assigned to each WCMP member in the ASIC is calculated according to the relative path bandwidth:

      how to calculate weight in FARE

      Where:

      • Scale_Factor is a scaling factor that converts floating-point weights into simplified integer values by removing the greatest common divisor (GCD), while ensuring that the total number of weighted members does not exceed the ASIC limit for a single ECMP/WCMP group.

      In the simulation example, the WCMP group installed on Leaf 2 assigns forwarding weights to links L5, L6, L7, and L8 in the ratio: 75.0 : 75.0 : 100.0 : 100.0 = 3 : 3 : 4 : 4

      As a result, when the forwarding capacity of the Spine 1 path toward Leaf 1 is reduced, Leaf 2 automatically directs a larger share of traffic to Spine 2. This prevents congestion and packet loss caused by oversubscribing the lower-bandwidth path.

      2.3 Packet Spray

      When Packet Spray[1][2] is enabled, the ingress forwarding pipeline ignores the packet five-tuple, including the source IP, destination IP, source port, destination port, and protocol number. Load balancing is no longer performed on a per-flow basis. Regardless of whether the traffic consists of a small number of high-bandwidth elephant flows or a large number of short-lived flows, packets are treated as independent forwarding units. Each packet is distributed across the available egress links using a round-robin scheduling algorithm.

      In the ASIC, WCMP is implemented on top of the ECMP forwarding mechanism. Forwarding weights are realized by replicating ECMP members according to their assigned weights. For example, if the WCMP weight ratio is L1:L2:L3 = 1:1:2:2, the ASIC installs an ECMP group with six members: L1, L2, L3, L3, L4, L4

      When Packet Spray is enabled, the ASIC distributes packets evenly across these six ECMP members. Since L1 and L2 each appear once, while L3 and L4 each appear twice, the long-term traffic distribution converges to the same 1:1:2:2 ratio. As a result, traffic allocation accurately matches the bandwidth capacity of each path.

      The combination of Packet Spray and WCMP eliminates the hash polarization caused by per-flow load balancing, significantly reducing microburst congestion. It also ensures that traffic is distributed according to the expected bandwidth ratio, maximizing utilization of higher-bandwidth links.

      2.4 Native Fast Failure Convergence

      During long-running, large-scale AI training, hardware failures and link outages are inevitable. BGP FARE is built entirely on the native BGP Path Bandwidth extended community attribute, allowing it to inherit the mature and reliable failure convergence mechanisms of the BGP control plane.

      When an intermediate node detects a link failure, the local BGP process immediately withdraws the affected route. The corresponding bandwidth update is carried directly in the Path Bandwidth extended community attribute of the BGP route advertisement. Upon receiving the update, neighboring nodes detect the change in the Path Bandwidth value, recalculate the available path bandwidth, and advertise the updated value to their peers. In this way, bandwidth changes caused by topology events propagate hop by hop throughout the network.

      This mechanism requires no external traffic engineering controller or out-of-band link monitoring system. Because bandwidth updates are distributed through native BGP route propagation, BGP FARE provides a short convergence path, fast failure recovery, and excellent scalability for large-scale AI fabrics.

      3. Typical Deployment Scenarios

      With deterministic line-rate bandwidth awareness, packet-level load balancing in the data plane, and a simple, stable control plane, BGP FARE addresses the key performance bottlenecks of modern high-throughput networks.

      3.1 Large-Scale AI Factory

      In AI training clusters with tens of thousands or even hundreds of thousands of GPUs, BGP FARE generates WCMP forwarding weights solely from the physical line-rate bandwidth of each path. It does not rely on complex real-time traffic telemetry or dynamically adjust WCMP weights based on instantaneous network conditions, thereby avoiding route oscillations caused by frequent control-plane updates.

      When a link degrades or fails, BGP FARE leverages native BGP route withdrawal together with the Path Bandwidth extended community attribute to recalculate and propagate updated bandwidth information. This enables millisecond-level convergence without requiring an external traffic engineering controller.

      link failure convergence

      When GPU1 on Server1 communicates with GPU1 on Server65, a link failure occurs between Spine4 and Leaf9. This failure reduces the aggregated available bandwidth for all routes from Spine4 to Server65 GPU1.

      As a result, Spine4 recalculates the available path bandwidth and advertises updated route information with reduced bandwidth values to Leaf1. Upon receiving the update, Leaf1 detects the decrease in available path bandwidth and updates its WCMP forwarding weights accordingly.

      The updated WCMP weights reduce the amount of traffic forwarded to Spine4. This prevents congestion and buffer buildup on Spine4, avoiding packet loss that would otherwise occur under traditional ECMP-based equal-cost traffic distribution.

      3.2 High-Performance Computing (HPC) Clusters

      HPC clusters often evolve through phased construction and multi-generation hardware upgrades. In addition, physical link asymmetry caused by partial upgrades or link speed degradation is common. As a result, multiple paths with different line rates coexist in the network.

      heterogeneous capacity network topology

      In the example topology, Spine1–Spine4 provide 100G per-port bandwidth, while Spine5–Spine8 provide 200G per-port bandwidth. When GPU1 on Server1 communicates with GPU8 on Server128, a traditional ECMP approach distributes traffic evenly across Spine1 to Spine8. This leads to congestion and potential packet loss on lower-speed devices such as Spine1, while higher-speed devices such as Spine5 remain underutilized.

      With BGP FARE, the hop-by-hop bandwidth constraint propagation allows Leaf1 to accurately learn the real physical bottlenecks across multiple paths. WCMP weights are then computed automatically based on path capacity. Traffic is distributed such that approximately 33% is forwarded through Spine1–Spine4, and 67% through Spine5–Spine8.

      This capacity-aware traffic distribution improves utilization efficiency and ensures stable operation in heterogeneous bandwidth environments.

      [1] Packet Spray is supported only on CX864E-N. On other platforms, a hash-enhanced load balancing mechanism can be used as an alternative.
      [2] Packet Spray requires NICs with hardware-based out-of-order reassembly capability at the endpoint. For NICs that do not support this feature, Packet Spray can be disabled.

      Ready to Implement?

      Explore our detailed implementation guides to turn these white paper insights into real-world networking solutions. From RoCE to Zero-Touch Provisioning, we’ve got you covered.