Skip to main content

RoCEv2 AI Solution with NVIDIA DGX SuperPOD

Overview

The NVIDIA DGX SuperPOD™ with NVIDIA DGX™ H100 systems is the cutting-edge data center architecture for AI. Asterfusion provides another approach of RoCEv2 enabled 100G-800G Ethernet switches to build a DGX H100 based superPOD instead of InfiniBand switches, including compute fabric, storage fabric, in-band and out-of-band management network.

Compute Fabric

For a research project or PoC, starting with a small-scale cluster is a good choice. Less than 4 nodes DGX H100 can be connected with one single 32x400G switch, using QSFP-DD transceivers on switch ports and OSFP transceivers on ConnectX-7 NIC of each GPU server, connecting by MPO-APC cables.

When the number of GPU servers increases to 8, a 2+4 CLOS fabric can be applied. Each leaf switch connects 2 of 8 ports of each GPU server, to archieve a rail-optimized connectivity to increase the efficiency of GPU resources. Spine and leaf switches can be connected through QSFP-DD transceivers with fiber, or QSFP-DD AOC/DAC/AEC cables.

Single CX864E-N (51.2Tbps Marvell Teralynx 10 based 64x800G switch, coming in 2024Q3) can hold even up to 16 GPU servers:

A scalable compute fabric is built upon modular blocks, which provides a rapid deployment of multiple scales. Each block contains a group of GPU servers and 400G switches (16 x DGX H100 systems with 8 x CX732Q-N in this example). The block is designed as rail-aligned and non-oversubscribed, to ensure the performance of the compute fabric.

A typical 64-node compute fabric design is to use Leaf-Spine architecture to hold 4 blocks:

To scale the compute fabric, higher throughput switches can be used to replace the Spine. CX864E-N can be placed here to expand the network to maximum 16 blocks of 256 nodes:

Table 1 shows the number of switches and cables required for the compute fabric of different scales.

Block CountNodeGPUSpineLeafCable
-432-132
-86424128
11612848256
232256816512
3483841224768
46451216321024
16256204816 (64x800G)1284096
>16>256>2048Configure according to the actual size, using all 64x800G network
Table 1. Compute fabric component count of different scales

Storage Fabric

Storage specifications vary by vendors’ products. In this case, we assume each node provides 45GB/s read performance, with 4x200G interfaces.

Here we assume each DGX H100 node requires 8GB/s read performance (reference as table 2 & 3, from NVIDIA document), which is 256GB/s in total (requires 6 storage nodes in this case).

MC-LAG is usually used to achieve redundancy, here we recommend EVPN Multi-Homing to reduce switch port requirements.

Performance LevelWork DescriptionDataset Size
GoodNatural Language Processing (NLP)Datasets generally fit within local cache
BetterImage processing with compressed images (ex: ImageNet)Many to most datasets can fit within the local system’s cache
BestTraining with 1080p, 4K, or uncompressed images, offline inference, ETL,Datasets are too large to fit into cache, massive first epoch I/O requirements, workflows that only read the dataset once
Table 2. Storage performance requirements

Performance CharacteristicGood (GBps)Better (GBps)Best (GBps)
Single-node read4840
Single-node write2420

In-Band Management Network

The in-band network connects all the compute nodes and management nodes (general-purposed x86 servers), and provides connectivity for the in-cluster services and services outside of the cluster.

Out-of-Band Management Network

Asterfusion also offers 1G/10G/25G enterprise switches for OOB, which also support Leaf-Spine architecture to achieve management of all the components inside the superPOD.