New Tech shows the synergies that combining CPU, DPU, and GPU can enable in AI. Especially in Very Large AI.
After steering a 25% rise in NVIDIA shares last week, CEO Jensen Huang flew to Computex in Taipei to announce a slew of new products demonstrating how his company intends to continue to lead as it approaches the 1 Trillion dollar market cap milestone. While these products alone won’t take the company to $2T, they indicate just how important generative AI will be to get there.
What did NVIDIA Announce?
Major NVIDIA announcements are typically reserved for the annual GTC event in Silicon Valley. But this year is different; this is the year that AI graduates from cool technology to the must-have solutions possessing nearly human intelligence and infinite knowledge that every Global 500 company on earth will need to master.
While NVIDIA announced products that span from AI Supercomputers to gaming characters that can hold a conversation, we will focus here on the three technologies that directly impinge on this AI moment: the Grace Hopper-based 256-GPU GH200, the MGX platform for system builders, and the new Spectrum-X Ethernet networking that ties it all together.
The GH200
NVIDIA CEO Jensen Huang has been telling us for years that NVIDIA is in the business of selling optimized data centers, not chips and components. You probably missed the boat if you mistakenly took that as hyperbole. NVIDIA has been selling DGX SuperPods, since the A100, but while the scale of these systems could be quite large, the NVLink-enabled optimization (shared memory) was limited to 8-GPUs. A programmer could treat 8 GPUs as one large GPU sharing a large memory pool.
But the definition of “large” changed dramatically when GPT-4 hit the silicon streets, and is rumored to contain a trillion parameters. You need a massive amount of shared, fast (HMB) memory to train these huge AI models as well as run inference queries. Customers like Google, Meta, and Microsoft/Open AI need a far larger footprint.
Now, Jensen has announced the DGX GH200 massive-memory supercomputer for generative AI. Powered by Grace Hopper and NVLink to train large (there’s that word again) AI models and drive AI innovation forward. What DGX did for smaller AI, the GH200 will do for massive AI builders. The GH200 is interconnected with NVLink to provide 1 exaflop of AI (low precision) performance and 144 terabytes of shared memory — nearly 500x more than the previous generation NVIDIA DGX A100, introduced in 2020. This increases the bandwidth between GPU and CPU by 7x compared with the latest PCIe technology, slashes interconnect power consumption by more than 5x, and provides a 600GB Hopper architecture GPU building block for DGX GH200 supercomputers.
Speaking of hyperscalers, NVIDIA mentioned three: Google, Meta, and Microsoft, all of which voiced support in NVIDIA’s press materials with great quotes. Consequently, we would be shocked if these AI leaders don’t deploy multiple GH200. Amazon AWS was notably MIA. We are convinced that AWS intends to go it alone, favoring their own AI chips: Inferentia and Trainium. Good luck with that; these chips aren’t simply not competitive with NVIDIA. Nor is AWS networking competitive with NVLINK, not to mention the new NVIDIA Ethernet technology also announced at Computex.
As is usually the case, NVIDIA is its own best customer for the GH200. NVIDIA announced it is building the Helios Supercomputer to advance AI research and development. This private supercomputer, akin to Selene, will feature four DGX GH200 systems, and will be interconnected with NVIDIA Quantum-2 InfiniBand networking to accelerate data throughput for training large AI models. The 1,024 Grace Hopper Superchip system is expected to come online by the end of the year.
MGX
NVIDIA must also enable its partners, who play a pivotal role in extending NVIDIA’s market reach into 1st and 2nd tier CSPs. For system vendors like HPE, Dell, Lenovo, Penguin, etc., NVIDIA created the HGX reference board to enable 8-way GPU connectivity. But for more flexibility to mix and match technologies, Jensen announced what is called MGX. MGX enables over 100 unique configurations of NVIDIA CPU, GPU and DPU components to meet the individual needs of their customers. NVIDIA announced six partners normally associated with building hyperscale infrastructure as the first to sign up for MGX.
Spectrum-X
NVIDIA acquired the InfiniBand technology when it bought Mellanox three years ago. Infiniband is great for supercomputers because, in part, it is “lossless”, unlike Ethernet which recovers from lossed packets by trying again. And again. And again until the missing network packet finally arrives at its destination. Thats fine for cloud services, but not for HPC. And, it appears, not for large scale AI.
Massive AI needs the performance and lossless packet delivery of Infiniband, but prefers the lower-cost and ubiquitous Ethernet networking to run its data centers. As the figure below shows in the upper right, Ethernet’s bandwidth fluctuates considerably as the TCP/IP protocol is durable to frequent packet drops. And thats just not ok with big AI.
NVIDIA’s solution is to provide these customers with Spectrum-X, a combination of a new “Spectrum-4” Ethernet switch combined with the high performance BlueField-3 DPU. The combination of the Spectrum-4 switch, the BlueField NIC, and the NVIDIA networking software stack achieves 1.7x better overall AI performance and power efficiency, along with the consistency afforded by a lossless network. Thats right: NVIDIA is promising an Ethernet network that does not drop packets. Sounds like magic to me, but this is just what enterprise and hyperscale data centers have been asking for since decades.
Conclusions
NVIDIA has taken a holistic approach to training and running large AI models, laying out a reference architecture offering a 256-GPU building block that users of large AI will embrace. Combined with the recent Dell announcement, supercomputing traction for Grace, Microsoft Azure support for NVIDIA Enterprise AI, the new MGX reference architecture, NVLink, and the Spectrum-X lossless Ethernet solution means that anyone running serious AI jobs has only one logical choice: NVIDIA.
Could this change? Yes. Competition will always nibble at NVIDIA’s heels. AMD has a serious entry into the market coming soon, the MI300, later this year. RISC-V solutions like Tenstorrent and Esperanto are getting attention and traction, but not in the market where Jensen is focussed: massive Foundation Models. Intel could pull a rabbit out of the hat with Gaudi3 and/or the forever-late PonteVecchio GPU.
But as I have said in the past, considering NVIDIA’s superior hardware combined with the depth and breadth of NVIDIA’s software to optimize AI and HPC applications, all competitors combined could maybe get 10% of the market. In a $75B market, that could be plenty to float some more boats.
But NVIDIA is the only trillion-dollar player and we don’t see that changing.
Stay connected with us on social media platform for instant update click here to join our Twitter, & Facebook
We are now on Telegram. Click here to join our channel (@TechiUpdate) and stay updated with the latest Technology headlines.
For all the latest Technology News Click Here