The AI Data Center Revolution: Can Ultra Ethernet Unlock 90%+ Network Utilization?
written by Asterfuison
Table of Contents
1. Why Is It Difficult for Traditional Ethernet to Break the Utilization Bottleneck?
Network utilization is a key metric in data communication networks, defined as the ratio of actual transmitted traffic to available bandwidth. In AI data center networks, network utilization directly impacts critical business performance indicators such as Job Completion Time (JCT) and Inference Throughput, making it an especially vital metric.
Traditional Ethernet typically achieves a utilization rate of around 35% to 40%, a phenomenon caused by several factors:
- Traffic Diversity: Traffic exhibits randomness and unpredictability, with packets of varying sizes, rates, and durations coexisting. Networks must be designed to handle peak traffic loads, resulting in lower utilization during idle periods.
- Blocking in Network Design: The traditional multi-tier access-aggregation-core network architecture features convergence ratios between layers. This leads to either idle access links or congested aggregation links, preventing optimal network performance.
- Transmission Losses: During congestion, packets at the tail of queues are dropped, necessitating end-to-end retransmissions that further waste bandwidth.
- Weak Flow Control Perception: Flow control mechanisms lack robust awareness and adaptability to network load, failing to adjust traffic in real time. During startup or adjustment phases, bandwidth remains underutilized.
- Unbalanced Network Load: Flow-based Equal-Cost Multi-Path (ECMP) routing operates at a coarse granularity, and traditional routing protocols rely on static bandwidth data to calculate optimal paths. This results in some links being overloaded while others remain idle.
2. Ultra Ethernet Aims to Boost Network Utilization for AI Data Center
In AI data center networks, the goal is to elevate network utilization above 85%, a target set by the Ultra Ethernet Consortium (UEC) [1].

2.1 Dedicated Networks
When the AI model, parallelism strategy, and input data are fixed, the traffic patterns for AI training and inference become predictable [2]. By building dedicated networks for AI training and inference traffic—fully isolated from other traffic—we lay the foundation for higher network utilization.
2.2 Non-Blocking Topologies
Next, we need a non-blocking network architecture such as CLOS, Dragonfly, Torus, MegaFly, or SlimFly. Currently, CLOS is the most popular topology [3], where total access bandwidth equals total aggregation bandwidth, enabling easy scaling both vertically and horizontally while achieving macroscopic non-blocking behavior. However, due to unbalanced traffic distribution and micro-bursts, localized congestion can still occur on specific links.
2.3 Lossless Transmission
In traditional Ethernet, congestion is managed by dropping packets using mechanisms like tail-drop or Weighted Random Early Detection (WRED) to address transient bandwidth oversubscription. This approach is undeniably crude. To address this, InfiniBand and RoCE protocols employ Explicit Congestion Notification (ECN) and link-level flow control techniques, such as InfiniBand’s Credit-Based Flow Control (CBFC) and RoCE’s Priority-Based Flow Control (PFC). These technologies effectively handle point-to-point burst congestion and are widely adopted in data center networks.
However, AI training and inference generate collective communication patterns (e.g., Broadcast, All-Reduce, All-Gather, All-to-All, etc.), leading to In-Cast congestion—where multiple sources simultaneously send traffic to a single destination, causing congestion on the destination link. Addressing this requires new techniques. The UEC proposes In-Network Computing (INC), which leverages switch computation to reduce the amount of traffic transmitted over the network, mitigating In-Cast congestion. However, INC is primarily effective for All-Reduce operations, where traffic can be reduced during transmission, and its impact on other collective communication patterns is limited. In-Cast congestion still demands more advanced solutions.
2.4 UEC Congestion Control
When In-Cast congestion occurs, current solutions rely on end-to-end flow control mechanisms. For example, ECN-based technologies like DCQCN and DCTCP adjust the source’s sending rate to match available bandwidth. However, since ECN carries only 1 bit of information, this adjustment is imprecise. For instance, DCQCN drastically reduces the rate upon receiving an ECN signal, then gradually increases it until it matches available bandwidth—a slow process during which utilization remains suboptimal. To address this, the Ultra Ethernet Transport Layer (UET) introduces the following improvements:
- Accelerated Adjustment: UET adjusts sending rates based on end-to-end latency measurements and receiver’s capacity, enabling rapid convergence to line rate.
- Telemetry-based: Congestion data from the network identifies the location and cause of congestion, shortening signaling paths and providing endpoints with richer information for faster responses.
2.5 Packet Spraying
While UEC’s flow control techniques address congestion between source and destination, load imbalance persists across the network—some links are overutilized while others remain underutilized. Traditionally, ECMP is used for network-wide load balancing, distributing flows (defined by IP 5-tuple) across equal-cost paths via hashing. However, when flow sizes vary, traffic distribution becomes unbalanced. In AI training and inference, flows have low entropy, long durations, and high volumes, leading to “polarization”, where flows concentrate on certain paths, leaving others idle.
To tackle this, UEC introduces packet spraying, which evenly distributes individual packets across multiple equal-cost paths to maximize bandwidth utilization. This approach causes out-of-order packet arrival at the destination, requiring protocol modifications to handle reordering and reassemble packets into complete messages. However, reassembly introduces overhead, increasing latency and preventing pipelined processing, as the destination must wait for all packets of a flow to arrive.
3. Asterfusion Unlocks the Full Potential of Ultra Ethernet Utilization in Maximizing AI Data Center Efficiency
The UEC has charted a path to improve network utilization with Ultra Ethernet, but as an evolving standard, its implementation depends on vendor innovation. As an early UEC member, Asterfusion has developed cutting-edge technologies to push network utilization to 90% and beyond.
3.1 Flowlet
Flow-based ECMP often leads to load imbalance, while packet spraying introduces latency overhead. Is there a middle ground? Enter flowlet. A flowlet splits a flow into segments based on “idle” time gaps. Within a flowlet, packets are temporally contiguous; between flowlets, significant time gaps exist—far larger than inter-packet gaps within a segment—allowing flowlets to traverse different paths without reordering issues.

In parallel computing, computation and communication alternate, making AI training and inference traffic inherently flowlet-based.
During congestion, flowlets can be rerouted to less busy links. In AIDC networks, RDMA flows are persistent (training flows last minutes to hours, inference flows seconds to minutes), while flowlets are short bursts (microseconds to milliseconds). Fine-grained flowlet scheduling optimizes traffic distribution, reducing congestion and boosting utilization.
3.2 INT-Based Routing
Routing protocols traditionally calculate optimal paths using static network data (e.g., OSPF uses bandwidth, BGP uses AS-PATH length). This disconnects routing from real-time load. Asterfusion’s INT-based routing integrates OSPF, BGP, and In-Network Telemetry (INT) to compute multiple paths between node pairs, with path costs based on dynamically measured latency. This enables real-time load-aware routing, fully utilizing available bandwidth.

3.3 WCMP
ECMP evenly distributes packets, flowlets, or flows across paths, ignoring actual path loads.
Asterfusion’s Weighted Cost Multi-Path (WCMP) algorithm uses telemetry-derived latency data to allocate more traffic to lower-latency paths and less to higher-latency ones, ensuring fair utilization. Ideally, total latency across paths equalizes, maximizing bandwidth usage.
4. Conclusion
Asterfusion’s CX864E and other Ultra Ethernet switches leverage Flowlet, INT-Based Routing, and WCMP to elevate AI training and inference network utilization above 90%. This accelerates AI workloads while reducing data center construction and operational costs.
Reference
[1] Ultra Ethernet Consortium, “Ultra Ethernet Introduction” 15th October 2024.
[2] Asterfusion, “Unveiling AI Data Center Network Traffic.”
[3] Asterfusion, “What is Leaf-Spine Architecture and How to Build it?”