The Ultimate In-Depth Exploration of Ultra Ethernet Consortium (UEC) Technology
written by Asterfuison
Table of Contents
On October 15, 2024, at the 2024 OCP Global Summit, the Ultra Ethernet Consortium (UEC) presented its latest progress – the preview of UEC Specification 1.0.
Since its establishment in July 2023, UEC has grown rapidly with over 90 members, including major industry vendors. Its goal is to develop an open, interoperable, and high-performance communication protocol stack based on Ethernet to meet the growing network demands of large-scale AI and HPC deployments.
UEC aims to achieve three ambitious goals:
- As performant as a supercomputing interconnect
- As ubiquitous and cost-effective as Ethernet
- As scalable as a cloud data center
Ultra Ethernet V1.0 technical targets for GPU/TPU Scale-Out networks
- Support for AI training/inference clusters of up to 1 million GPUs/TPUs
- Round-trip time under 10μs
- Single interface bandwidth of 800Gbps and beyond
- Network utilization exceeding 85%
Current Challenges in AI and HPC Networking
AI and HPC networks typically rely on RoCE or InfiniBand protocol stacks. However, as AI and HPC traffic experience explosive growth, these protocols have revealed several critical limitations:
- In-Order Delivery Requirements Packets must be delivered in sequence, with any out-of-order packets triggering retransmission.
- Flow-Based Load Balancing Constraints Load balancing operates on a per-flow basis, preventing individual packets within the same flow from utilizing multiple paths. This is particularly problematic for AI networks with numerous “elephant flows,” resulting in underutilization of available paths.
- Go-Back-N Inefficiency Within RDMA operations, if a single packet is dropped, the entire RDMA message following that packet must be retransmitted.
- DCQCN Implementation Challenges The DCQCN congestion control mechanism faces several obstacles: slow response to network congestion, complex network-wide deployment, requires extensive tuning.
UEC (Ultra-Ethernet Consortium) technology addresses these fundamental challenges, offering a next-generation solution for high-performance networking demands.
1 Ultra-Ethernet System
The Ultra-Ethernet system consists of clusters built from nodes and fabric infrastructure. Nodes connect to the network through Fabric Interfaces (network cards), which can host multiple logical Fabric End Points (FEPs). The network comprises multiple planes, each containing interconnected FEPs typically linked via switches.
Key system contexts:
– CCC (Congestion Control Context): Manages network congestion
– PDC (Packet Delivery Context): Handles end-to-end packet delivery
– PASID (Process Address Space ID): Identifies specific jobs within processes
The clusters have two operational modes; the fabric supports two concurrent operational modes:
1. Parallel Job Mode
– Supports MPI, xCCL, and SHMEM workloads
– Features run-to-completion operation
– Enables simultaneous multi-node communication
2. Client/Server Mode
– Ideal for storage operations
– Server continuously services multiple clients
– Communication occurs between node pairs
2 UEC Protocol Stack Overview
UEC redefines Ethernet and introduces a next-generation transport protocol specifically designed for AI/HPC applications. The complete protocol stack is illustrated as follows:
The blue is mandatory features and yellow is optional features.
- Physical layer is fully compatible with traditional Ethernet, and optionally supports Forward Error Correction (FEC) statistics.
- Link Layer optically supports Link Level Retry (LLR) and supports packet header compression, enhanced LLDP for feature negotiation.
- Network layer is Standard IP protocol (unchanged).
- Transport layer is completely redesigned as the core of the UEC protocol stack, featuring:
– Packet Delivery sublayer: Implements next-gen congestion control and flexible packet ordering, etc.
– Message Semantics sublayer: Supports xCCL and MPI messages, etc.
– Optional security transport features
– In-Network Collective (INC) implementation
- Software API Layer provides an extended Libfabrics 2.0 interface.
3 Physical Layer
Physical Layer is fully compatible with IEEE 802.3 standard Ethernet, supports 100Gbps and 200Gbps per channel, and is scalable to 800Gbps and higher port speeds.
The UEC offers optional physical layer performance monitoring capabilities based on FEC (Forward Error Correction) codewords. These metrics operate independently of traffic patterns and link utilization rates. Performance calculations utilize FEC error counter data to determine UCR (Uncorrectable Codeword Rate) and MTBPE (Mean Time Between Packet Errors). These metrics provide crucial insights into physical layer transmission performance and reliability, supporting upper-layer telemetry and congestion control mechanisms.
The Reconciliation Sublayer (RS) has been modified to accommodate new UEC link layer capabilities, ensuring seamless integration of enhanced features.
4 Link Layer
4.1 Link Level Retry
The UEC link layer introduces LLR (Link Level Retry) protocol as its major advancement, enabling lossless transmission without relying on Priority Flow Control (PFC).
The LLR mechanism is frame-based. Each frame is assigned a sequence number, and when the receiver successfully receives the frame, it checks that the sequence number of the frame is as expected and sends an acknowledgement message (ACK) if it is correct, or a negative acknowledgement message (NACK) if the frame is found to be out-of-order or lost frames. The sender has a timeout mechanism to ensure retransmission if NACK is lost.
Traditional replay mechanisms operate at the transport layer. UEC’s implementation moves this to the link layer delivers and has the following benefits in link layer transmission errors:
- Faster error recovery
- Elimination of unnecessary end-to-end retransmissions
- Reduced tail latency
4.2 Packet Rate Improvement
HPC and AI front-end applications often involve “mouse flows” requiring enhanced packet rates for small-packet performance optimization. PRI (PACKET RATE IMPROVEMENT) functionality compresses Ethernet and IP headers to improve packet rates, addressing current protocol stack inefficiencies caused by legacy features and redundant protocol fields.
4.3 Link Negotiation Protocol
UEC extends the LLDP protocol with negotiation capabilities to detect optional feature support at link endpoints and enable supported features (like LLR and PRI) when both sides are compatible.
5 Transport Layer
The traditional RDMA network transport layer (including IB and RoCE) has the following drawbacks:
Classic RDMA | UEC |
Flow-level multipathing | Packet-level multipathing |
In-order packet delivery | Out-of-order delivery, in-order message completion |
Go-back-n è inefficient | Selective retransmit |
DCQCN è hard to tune | Configuration-free congestion control |
Challenges with scale | Scales to 1,000,000 nodes |
This makes network determinism and predictability increasingly difficult as the size of AI/HPC clusters grows and requires a whole new approach to solve it.
The UEC Transport Layer (UET) runs on top of the IP and UDP protocols with the goal of solving the above problems, increasing network utilization, reducing forwarding latency, and supporting up to one million nodes.
5.1 Selective Retransmit
Traditional transmission protocols like TCP require strict transmission ordering and utilize a go-back-n mechanism. In RDMA communications, where messages typically consist of multiple packets, an error in a single packet triggers retransmission of all subsequent packets. This amplifies occasional transmission errors and exacerbates network congestion. UEC addresses this through selective retransmission, where only the erroneous packets are retransmitted.
5.2 Out-Of-Order Delivery
UET supports both ordered and unordered transmission. This flexibility is crucial in modern networks where multiple paths exist. When packets from the same flow traverse different routes, it can result in out-of-order delivery. Strict ordering requirements would prevent effective load balancing across multiple paths.
UET Transmission Modes:
• ROD (Reliable Ordered Delivery) – Requires congestion control, ordered, reliable, no retransmission
• RUD (Reliable Unordered Delivery) – Requires congestion control, unordered, reliable, no retransmission
• RUDI (RUD for Idempotent Operations) – Optional congestion control, unordered, reliable, with retransmission
• UUD (Unreliable Unordered Delivery) – Optional congestion control, unordered, unreliable, with retransmission
Unordered transmission requires larger packet buffers at the receiver to reassemble out-of-order packets into complete RDMA messages.
5.3 Packet Spraying
Packet spraying enables packet-based multipath transmission. Traditional protocols, which don’t support unordered delivery, must transmit an entire data flow along a single path to avoid out-of-order delivery and retransmission. However, AI/HPC applications often generate “elephant flows” – high-volume, long-duration data streams. Multipath transmission for these flows significantly improves overall network utilization.
With RUD support, UET can distribute packets from the same flow across multiple paths simultaneously through packet spraying. This leverages switches’ ECMP and WCMP (Weighted Cost Multi-Pathing) routing capabilities, enabling packet distribution across multiple paths to the same destination, substantially improving network utilization.
5.4 Congestion Control
It contains the following key features:
• Incast Management
Addresses fan-in issues on downlinks during collective communications. AI and HPC applications frequently use collective communication for multi-node synchronization. Incast congestion occurs when multiple senders simultaneously transmit to a single receiver.
• Accelerated Rate Adjustment
Unlike traditional congestion control algorithms that require extended periods for rate adjustment, UET can rapidly achieve line-rate speeds. This is accomplished through end-to-end latency measurements for transmission rate adjustment and receiver capability-based sender rate modification.
• Telemetry-Based Control
Network-sourced congestion information provides precise congestion location and cause identification, shortened congestion signaling paths, and enhanced endpoint information for faster response times.
• Adaptive Routing via Packet Spraying
Implements dynamic traffic rerouting through packet spraying to bypass congestion points during network stress.
These features implemented through coordinated endpoint and switch operations, effectively minimize tail latency.
5.5 UEC Security
UEC incorporates security at the transport layer with a job-based approach, enabling encryption for all traffic within a job. It efficiently leverages IPSec and PSP (Packet Security Protocol) capabilities to minimize encryption overhead and supports hardware offloading.
6 Software Layer
UEC provides streamlined APIs that simplify RDMA operations. It includes specialized APIs for AI and HPC applications, supporting frameworks such as xCCL, MPI, PGAS, OpenShmem.
7 UEC Switch
To implement the UEC protocol stack, network switches require upgrades. UEC defines a switch architecture that builds upon the SONiC switch framework.
The architecture features:
– Control Plane: Supports INC and SDN controllers
– Data Plane: Enhanced SAI (Switch Abstraction Interface) APIs to leverage hardware-based INC capabilities.
8 UEC End Point
The UEC Fabric End Point incorporates both hardware and software components. In the hardware layer, Network interface cards (NICs) with UEC functionality; In OS kernel, enhanced NIC drivers; In user space, extended libfabric implementation with INC management, supporting xCCL/MPI/SHMEM applications.
9 UEC and RDMA’s Comparison
The following table compares the UEC and traditional RDMA stacks:
Requirement | UEC | Legacy RDMA | UEC Advantage |
Link Level Low Latency | Link Level Retry Packet Rate Improvement | Priority-based Flow Control | Lower tail latency |
Multi-Pathing | Packet spraying | Flow-level multi-pathing | Higher network utilization |
Flexible Ordering | Out-of-order packet delivery with in-order message delivery | N/A | Matches application requirements, lower tail latency |
AI and HPC Congestion Control | Workload-optimized, configuration free, lower latency, programmable | DCQCN: configuration required, brittle, signaling requires additional round trip | Incast reduction, faster response, future-proofing |
In Network Collective | Bulti-in | NONE | Faster collective operation, lower latency |
Simplified RDMA | Streamlined API, native workload interaction, minimal endpoint state | Based on IBTA Verbs | App-level performance, lower cost implementation |
Security | Scalable, 1st class citizen | Not addressed, external to spec | High scale, modern security |
Large Scale with Stability and Reliability | Targeting 1M endpoints | Typically, a few thousand simultaneous end points | Current and future-proof scale |
UEC demonstrates significant improvements compared to traditional RDMA networks:
Physical Layer
– Full Ethernet compatibility
– Enhanced network visibility through comprehensive statistics
– Enables high-speed, cost-effective network visualization
Link Layer
– Improved retransmission mechanisms
– Ethernet and IP header compression
– Reduced tail latency
Transport Layer
– Revolutionary congestion control for AI/HPC workloads
– Flexible packet ordering
– Packet spraying technology for improved network utilization
End-to-End Features
– Complete INC implementation
– Reduced Incast congestion
– Faster collective communications with lower latency
– Streamlined RDMA software APIs for enhanced AI/HPC application performance
In conclusion, UEC reimagines data center Ethernet networking, offering a comprehensive alternative to traditional RDMA networks. It enables stable, reliable AI/HPC clusters scaling to millions of nodes while delivering superior performance at lower costs.
Asterfusion CX-N Series:UEC Ready Data Center Switch For AI&HPC
In March 2024, Asterfusion joined the Ultra Ethernet Consortium (UEC),as a full member, Asterfusion is at the forefront of developing the next-generation communication stack architecture, ensuring Ethernet’s continued relevance and efficiency in the AI era.
As the Ultra Ethernet Consortium (UEC) completes its expansion to improve Ethernet for AI workloads, Asterfusion is building products that will be ready for the future. The Asterfusion CX-N AI data centre switch portfolio is the definitive choice for AI networks, leveraging standards-based Ethernet systems to provide a comprehensive range of intelligent features. These features include dynamic load balancing, congestion control, and reliable packet delivery to all ROCE-enabled network adapters. As soon as the UEC specification is finalised, the Asterfusion AI platform will be upgradeable to comply with it.