Skip to main content

The Ultimate In-Depth Exploration of Ultra Ethernet Consortium (UEC) Technology

written by Asterfuison

October 24, 2024

On October 15, 2024, at the 2024 OCP Global Summit, the Ultra Ethernet Consortium (UEC) presented its latest progress – the preview of UEC Specification 1.0.

Since its establishment in July 2023, UEC has grown rapidly with over 90 members, including major industry vendors. Its goal is to develop an open, interoperable, and high-performance communication protocol stack based on Ethernet to meet the growing network demands of large-scale AI and HPC deployments.

UEC aims to achieve three ambitious goals:

  • As performant as a supercomputing interconnect
  • As ubiquitous and cost-effective as Ethernet
  • As scalable as a cloud data center

Ultra Ethernet V1.0 technical targets for GPU/TPU Scale-Out networks

  • Support for AI training/inference clusters of up to 1 million GPUs/TPUs
  • Round-trip time under 10μs
  • Single interface bandwidth of 800Gbps and beyond
  • Network utilization exceeding 85%
Ultra Ethernet targets the Scale-out Network
Figure 1 Ultra Ethernet targets the Scale-out Network[i]

Current Challenges in AI and HPC Networking

AI and HPC networks typically rely on RoCE or InfiniBand protocol stacks. However, as AI and HPC traffic experience explosive growth, these protocols have revealed several critical limitations:

  • In-Order Delivery Requirements Packets must be delivered in sequence, with any out-of-order packets triggering retransmission.
  • Flow-Based Load Balancing Constraints Load balancing operates on a per-flow basis, preventing individual packets within the same flow from utilizing multiple paths. This is particularly problematic for AI networks with numerous “elephant flows,” resulting in underutilization of available paths.
  • Go-Back-N Inefficiency Within RDMA operations, if a single packet is dropped, the entire RDMA message following that packet must be retransmitted.
  • DCQCN Implementation Challenges The DCQCN congestion control mechanism faces several obstacles: slow response to network congestion, complex network-wide deployment, requires extensive tuning.

UEC (Ultra-Ethernet Consortium) technology addresses these fundamental challenges, offering a next-generation solution for high-performance networking demands.

1 Ultra-Ethernet System

The Ultra-Ethernet system consists of clusters built from nodes and fabric infrastructure. Nodes connect to the network through Fabric Interfaces (network cards), which can host multiple logical Fabric End Points (FEPs). The network comprises multiple planes, each containing interconnected FEPs typically linked via switches.

Ultra Ethernet System
Figure 2 Ultra Ethernet System

Key system contexts:

– CCC (Congestion Control Context): Manages network congestion

– PDC (Packet Delivery Context): Handles end-to-end packet delivery

– PASID (Process Address Space ID): Identifies specific jobs within processes

The clusters have two operational modes; the fabric supports two concurrent operational modes:

1. Parallel Job Mode

   – Supports MPI, xCCL, and SHMEM workloads

   – Features run-to-completion operation

   – Enables simultaneous multi-node communication

2. Client/Server Mode

   – Ideal for storage operations

   – Server continuously services multiple clients

   – Communication occurs between node pairs

2 UEC Protocol Stack Overview

UEC redefines Ethernet and introduces a next-generation transport protocol specifically designed for AI/HPC applications. The complete protocol stack is illustrated as follows:

Ultra Ethernet Protocol Stack
Figure 3 Ultra Ethernet Protocol Stack

The blue is mandatory features and yellow is optional features.

  • Physical layer is fully compatible with traditional Ethernet, and optionally supports Forward Error Correction (FEC) statistics.
  • Link Layer optically supports Link Level Retry (LLR) and supports packet header compression, enhanced LLDP for feature negotiation.
  • Network layer is Standard IP protocol (unchanged).
  • Transport layer is completely redesigned as the core of the UEC protocol stack, featuring:

 – Packet Delivery sublayer: Implements next-gen congestion control and flexible packet ordering, etc.

 – Message Semantics sublayer: Supports xCCL and MPI messages, etc.

 – Optional security transport features

 – In-Network Collective (INC) implementation

  • Software API Layer provides an extended Libfabrics 2.0 interface.

3 Physical Layer

Physical Layer is fully compatible with IEEE 802.3 standard Ethernet, supports 100Gbps and 200Gbps per channel, and is scalable to 800Gbps and higher port speeds.

The UEC offers optional physical layer performance monitoring capabilities based on FEC (Forward Error Correction) codewords. These metrics operate independently of traffic patterns and link utilization rates. Performance calculations utilize FEC error counter data to determine UCR (Uncorrectable Codeword Rate) and MTBPE (Mean Time Between Packet Errors). These metrics provide crucial insights into physical layer transmission performance and reliability, supporting upper-layer telemetry and congestion control mechanisms.

The Reconciliation Sublayer (RS) has been modified to accommodate new UEC link layer capabilities, ensuring seamless integration of enhanced features.

The UEC link layer introduces LLR (Link Level Retry) protocol as its major advancement, enabling lossless transmission without relying on Priority Flow Control (PFC).

The LLR mechanism is frame-based. Each frame is assigned a sequence number, and when the receiver successfully receives the frame, it checks that the sequence number of the frame is as expected and sends an acknowledgement message (ACK) if it is correct, or a negative acknowledgement message (NACK) if the frame is found to be out-of-order or lost frames. The sender has a timeout mechanism to ensure retransmission if NACK is lost.

Traditional replay mechanisms operate at the transport layer. UEC’s implementation moves this to the link layer delivers and has the following benefits in link layer transmission errors:

  • Faster error recovery
    • Elimination of unnecessary end-to-end retransmissions
    • Reduced tail latency

4.2 Packet Rate Improvement

HPC and AI front-end applications often involve “mouse flows” requiring enhanced packet rates for small-packet performance optimization. PRI (PACKET RATE IMPROVEMENT) functionality compresses Ethernet and IP headers to improve packet rates, addressing current protocol stack inefficiencies caused by legacy features and redundant protocol fields.

UEC extends the LLDP protocol with negotiation capabilities to detect optional feature support at link endpoints and enable supported features (like LLR and PRI) when both sides are compatible.

5 Transport Layer

The traditional RDMA network transport layer (including IB and RoCE) has the following drawbacks:

   Classic RDMAUEC                         
Flow-level multipathingPacket-level multipathing
In-order packet deliveryOut-of-order delivery, in-order message completion
Go-back-n è inefficientSelective retransmit
DCQCN è hard to tuneConfiguration-free congestion control
Challenges with scaleScales to 1,000,000 nodes

This makes network determinism and predictability increasingly difficult as the size of AI/HPC clusters grows and requires a whole new approach to solve it.

The UEC Transport Layer (UET) runs on top of the IP and UDP protocols with the goal of solving the above problems, increasing network utilization, reducing forwarding latency, and supporting up to one million nodes.

5.1 Selective Retransmit

Traditional transmission protocols like TCP require strict transmission ordering and utilize a go-back-n mechanism. In RDMA communications, where messages typically consist of multiple packets, an error in a single packet triggers retransmission of all subsequent packets. This amplifies occasional transmission errors and exacerbates network congestion. UEC addresses this through selective retransmission, where only the erroneous packets are retransmitted.

5.2 Out-Of-Order Delivery

UET supports both ordered and unordered transmission. This flexibility is crucial in modern networks where multiple paths exist. When packets from the same flow traverse different routes, it can result in out-of-order delivery. Strict ordering requirements would prevent effective load balancing across multiple paths.

UET Transmission Modes:

• ROD (Reliable Ordered Delivery) – Requires congestion control, ordered, reliable, no retransmission

• RUD (Reliable Unordered Delivery) – Requires congestion control, unordered, reliable, no retransmission

• RUDI (RUD for Idempotent Operations) – Optional congestion control, unordered, reliable, with retransmission

• UUD (Unreliable Unordered Delivery) – Optional congestion control, unordered, unreliable, with retransmission

Unordered transmission requires larger packet buffers at the receiver to reassemble out-of-order packets into complete RDMA messages.

5.3 Packet Spraying

Packet spraying enables packet-based multipath transmission. Traditional protocols, which don’t support unordered delivery, must transmit an entire data flow along a single path to avoid out-of-order delivery and retransmission. However, AI/HPC applications often generate “elephant flows” – high-volume, long-duration data streams. Multipath transmission for these flows significantly improves overall network utilization.

With RUD support, UET can distribute packets from the same flow across multiple paths simultaneously through packet spraying. This leverages switches’ ECMP and WCMP (Weighted Cost Multi-Pathing) routing capabilities, enabling packet distribution across multiple paths to the same destination, substantially improving network utilization.

5.4 Congestion Control

It contains the following key features:

Incast Management

Addresses fan-in issues on downlinks during collective communications. AI and HPC applications frequently use collective communication for multi-node synchronization. Incast congestion occurs when multiple senders simultaneously transmit to a single receiver.

Incast Congestion
Figure 4 Incast Congestion

• Accelerated Rate Adjustment

Unlike traditional congestion control algorithms that require extended periods for rate adjustment, UET can rapidly achieve line-rate speeds. This is accomplished through end-to-end latency measurements for transmission rate adjustment and receiver capability-based sender rate modification.

Telemetry-Based Control

Network-sourced congestion information provides precise congestion location and cause identification, shortened congestion signaling paths, and enhanced endpoint information for faster response times.

Adaptive Routing via Packet Spraying

Implements dynamic traffic rerouting through packet spraying to bypass congestion points during network stress.

These features implemented through coordinated endpoint and switch operations, effectively minimize tail latency.

5.5 UEC Security

UEC incorporates security at the transport layer with a job-based approach, enabling encryption for all traffic within a job. It efficiently leverages IPSec and PSP (Packet Security Protocol) capabilities to minimize encryption overhead and supports hardware offloading.

6 Software Layer

UEC provides streamlined APIs that simplify RDMA operations. It includes specialized APIs for AI and HPC applications, supporting frameworks such as xCCL, MPI, PGAS, OpenShmem.

7 UEC Switch

To implement the UEC protocol stack, network switches require upgrades. UEC defines a switch architecture that builds upon the SONiC switch framework.

Figure 6 UEC Switch Architecture
Figure 6 UEC Switch Architecture

The architecture features:

– Control Plane: Supports INC and SDN controllers

– Data Plane: Enhanced SAI (Switch Abstraction Interface) APIs to leverage hardware-based INC capabilities.

8 UEC End Point

The UEC Fabric End Point incorporates both hardware and software components. In the hardware layer, Network interface cards (NICs) with UEC functionality; In OS kernel, enhanced NIC drivers; In user space, extended libfabric implementation with INC management, supporting xCCL/MPI/SHMEM applications.

Figure 7 UEC End Point Architecture
Figure 7 UEC End Point Architecture

9 UEC and RDMA’s Comparison

The following table compares the UEC and traditional RDMA stacks:

RequirementUECLegacy RDMAUEC Advantage
Link Level Low LatencyLink Level Retry Packet Rate ImprovementPriority-based Flow ControlLower tail latency
Multi-PathingPacket sprayingFlow-level multi-pathingHigher network utilization
Flexible OrderingOut-of-order packet delivery with in-order message deliveryN/AMatches application requirements, lower tail latency
AI and HPC Congestion ControlWorkload-optimized, configuration free, lower latency, programmableDCQCN: configuration required, brittle, signaling requires additional round tripIncast reduction, faster response, future-proofing
In Network CollectiveBulti-inNONEFaster collective operation, lower latency
Simplified RDMAStreamlined API, native workload interaction, minimal endpoint stateBased on IBTA VerbsApp-level performance, lower cost implementation
SecurityScalable, 1st class citizenNot addressed, external to specHigh scale, modern security
Large Scale with Stability and ReliabilityTargeting 1M endpointsTypically, a few thousand simultaneous end pointsCurrent and future-proof scale

UEC demonstrates significant improvements compared to traditional RDMA networks:

Physical Layer

– Full Ethernet compatibility

– Enhanced network visibility through comprehensive statistics

– Enables high-speed, cost-effective network visualization

– Improved retransmission mechanisms

– Ethernet and IP header compression

– Reduced tail latency

Transport Layer

– Revolutionary congestion control for AI/HPC workloads

– Flexible packet ordering

– Packet spraying technology for improved network utilization

End-to-End Features

– Complete INC implementation

– Reduced Incast congestion

– Faster collective communications with lower latency

– Streamlined RDMA software APIs for enhanced AI/HPC application performance

In conclusion, UEC reimagines data center Ethernet networking, offering a comprehensive alternative to traditional RDMA networks. It enables stable, reliable AI/HPC clusters scaling to millions of nodes while delivering superior performance at lower costs.

Asterfusion CX-N Series:UEC Ready Data Center Switch For AI&HPC

In March 2024, Asterfusion joined the Ultra Ethernet Consortium (UEC),as a full member, Asterfusion is at the forefront of developing the next-generation communication stack architecture, ensuring Ethernet’s continued relevance and efficiency in the AI era.

As the Ultra Ethernet Consortium (UEC) completes its expansion to improve Ethernet for AI workloads, Asterfusion is building products that will be ready for the future. The Asterfusion CX-N AI data centre switch portfolio is the definitive choice for AI networks, leveraging standards-based Ethernet systems to provide a comprehensive range of intelligent features. These features include dynamic load balancing, congestion control, and reliable packet delivery to all ROCE-enabled network adapters. As soon as the UEC specification is finalised, the Asterfusion AI platform will be upgradeable to comply with it.

Latest Posts