RoCEv2 AI Solution with NVIDIA DGX SuperPOD

Overview

The NVIDIA DGX SuperPOD™ with NVIDIA DGX™ H100 systems is the cutting-edge data center architecture for AI. Asterfusion provides another approach of RoCEv2 enabled 100G-800G Ethernet switches to build a DGX H100 based superPOD instead of InfiniBand switches, including compute fabric, storage fabric, in-band and out-of-band management network.

Compute Fabric

For a research project or PoC, starting with a small-scale cluster is a good choice. Less than 4 nodes DGX H100 can be connected with one single 32x400G switch, using QSFP-DD transceivers on switch ports and OSFP transceivers on ConnectX-7 NIC of each GPU server, connecting by MPO-APC cables.

When the number of GPU servers increases to 8, a 2+4 CLOS fabric can be applied. Each leaf switch connects 2 of 8 ports of each GPU server, to archieve a rail-optimized connectivity to increase the efficiency of GPU resources. Spine and leaf switches can be connected through QSFP-DD transceivers with fiber, or QSFP-DD AOC/DAC/AEC cables.

Single CX864E-N (51.2Tbps Marvell Teralynx 10 based 64x800G switch, coming in 2024Q3) can hold even up to 16 GPU servers:

A scalable compute fabric is built upon modular blocks, which provides a rapid deployment of multiple scales. Each block contains a group of GPU servers and 400G switches (16 x DGX H100 systems with 8 x CX732Q-N in this example). The block is designed as rail-aligned and non-oversubscribed, to ensure the performance of the compute fabric.

A typical 64-node compute fabric design is to use Leaf-Spine architecture to hold 4 blocks:

To scale the compute fabric, higher throughput switches can be used to replace the Spine. CX864E-N can be placed here to expand the network to maximum 16 blocks of 256 nodes:

Table 1 shows the number of switches and cables required for the compute fabric of different scales.

Block Count	Node	GPU	Spine	Leaf	Cable
-	4	32	-	1	32
-	8	64	2	4	128
1	16	128	4	8	256
2	32	256	8	16	512
3	48	384	12	24	768
4	64	512	16	32	1024
16	256	2048	16 (64x800G)	128	4096
>16	>256	>2048	Configure according to the actual size, using all 64x800G network

Table 1. Compute fabric component count of different scales

Storage Fabric

Storage specifications vary by vendors’ products. In this case, we assume each node provides 45GB/s read performance, with 4x200G interfaces.

Here we assume each DGX H100 node requires 8GB/s read performance (reference as table 2 & 3, from NVIDIA document), which is 256GB/s in total (requires 6 storage nodes in this case).

MC-LAG is usually used to achieve redundancy, here we recommend EVPN Multi-Homing to reduce switch port requirements.

Performance Level	Work Description	Dataset Size
Good	Natural Language Processing (NLP)	Datasets generally fit within local cache
Better	Image processing with compressed images (ex: ImageNet)	Many to most datasets can fit within the local system’s cache
Best	Training with 1080p, 4K, or uncompressed images, offline inference, ETL,	Datasets are too large to fit into cache, massive first epoch I/O requirements, workflows that only read the dataset once

Table 2. Storage performance requirements

Performance Characteristic	Good (GBps)	Better (GBps)	Best (GBps)
Single-node read	4	8	40
Single-node write	2	4	20

In-Band Management Network

The in-band network connects all the compute nodes and management nodes (general-purposed x86 servers), and provides connectivity for the in-cluster services and services outside of the cluster.

Out-of-Band Management Network

Asterfusion also offers 1G/10G/25G enterprise switches for OOB, which also support Leaf-Spine architecture to achieve management of all the components inside the superPOD.

Low Latency Data Center Switch

Campus Access and Aggregation

Wireless Access Point

OpenWiFi Network Controller

Enterprise Ready SONiC NOS

Optical Transceiver

Marvell OCTEON Platform

Network Packet Broker

P4-Programmerable Switch

RoCEv2 AI Solution with NVIDIA DGX SuperPOD

Overview

CX732Q-N

CX732Q-N

CX664D-N

CX664D-N

CX-M Series

Compute Fabric

Storage Fabric

In-Band Management Network

Out-of-Band Management Network

Low Latency Data Center Switch

Campus Access and Aggregation

Wireless Access Point

OpenWiFi Network Controller

Enterprise Ready SONiC NOS

Optical Transceiver

Marvell OCTEON Platform

Network Packet Broker

P4-Programmerable Switch

Data Center

Enterprise

Carrier Network

RoCEv2 AI Solution with NVIDIA DGX SuperPOD

Overview

CX732Q-N

CX732Q-N

CX664D-N

CX664D-N

CX-M Series

Compute Fabric

Storage Fabric

In-Band Management Network

Out-of-Band Management Network