Skip to main content

The $100 Billion AI Supercomputer Project – Stargate, OpenAI Favors Ethernet Instead of InfiniBand?

written by Asterfuison

April 2, 2024

According to a new report from The Information, Microsoft and OpenAI are building a massive data center called “Stargate” for an AI supercomputer. The project could cost more than $100 billion. Microsoft is using InfiniBand cables in its current project, while OpenAI favors the Ethernet over IB for its network infrastructure.

Self-developed White Box Switch or IB switches?

Today, it seems easier than ever to self-develop high-end Ethernet switches as IB alternatives.

Instead of a few communication manufacturers relying on large chassis and backplanes, now we can use CLOS architecture to remove the performance limitations of switch backplanes and use common hardware design for high-performance networking. In addition, in the era of “software-defined everything”, the standard SAI and open source SONiC provide more flexible, customizable and high-performance network solution options. Maybe that’s why we’ve seen a number of large enterprises such as Byte-dance, Alibaba, begin to develop their own switches based on whiteboxes.

Nonetheless, for the absolute majority of organizations or enterprises interested in building their own AI or HPC clusters, the time cost of developing their own switches is much higher than purchasing switch products. Obviously self-developed white box switch not a reasonable option.

Whitebox Switch Vendors or Solution Provider for AI, HPC and Cloud?

Not only are companies like OpenAI and Microsoft considering Ethernet infrastructure, but because of the problems associated with the high concentration of vendors supplying InfiniBand switches ( the long lead times, high prices, etc.), many Tier 2 to 3 CSPs and some AI-related companies have been considering RoCE (RDMA over Converged Ethernet) as an alternative to InfiniBand for a long time.

Today’s white box switch vendors include Pica8, IP Infusion, Agema Systems, Foxconn Technology, Edgecore Networks, Celestica,  Asterfusion Data Technologies and others.

Among this, Asterfusion has been crafting their enterprise SONiC operating system (AsterNOS) since 2017 to provide a turnkey open network solution, which is one of the few vendors to offer both enterprise-ready SONiC distribution and whitebox switch hardware (according to a recent Gartner report ).

100G-800G Whitebox Switches and Turnkey SONiC Solution by Asterfusion

Asterfusion provides network switches ranging from 100G to 800G port capacities for Cloud and AI.

Whether tested in AIGC networks, HPC or distributed storage networks, Asterfusion’s RoCEv2 Ethernet switches meet or exceed the performance of infiniband switches, but are half the price of infiniband switches.

For the AI computing clusters required for LLM-training, cutting unnecessary cross-GPU server links without sacrificing performance will greatly reduce user network construction costs. With only one layer of rail switches and ultra low packet forwarding latency(~400ns), we can minimized the AIGC/ML communication overhead. If you are interested in a detailed solution and would like to receive on-site test results, please feel free to contact us by e-mail (bd@cloudswit.ch).

Latest Posts