In my first blog of this series, Introduction to AI Infrastructure, I introduced the components that make up the infrastructure for AI—compute, storage, network, software, and others. Before you can build an infrastructure to support AI workloads, however, you need to understand the language. Tech language always has been littered with acronyms, but AI introduces some you may not yet be familiar with.
Breaking it Down
For this article, let’s concentrate on compute. For AI workloads, GPUs are the workhorse elements of the infrastructure. As I wrote in the previous blog, “GPUs are designed to handle the parallel processing required for tasks like training neural networks and AI inferencing. NVIDIA’s DGX systems are industry leaders, offering unparalleled performance for AI workloads.”
When you dig into the details of NVIDIA’s systems, it’s helpful to know a bit about their products. What do the acronyms DGX, HGX, MGX, and others stand for, and what is the difference between them?
- DGXThis acronym stands for Deep Learning GPU eXtension/Accelerator. DGX is NVIDIA’s platform for enterprise AI. It provides a full-stack environment, integrating hardware and software specifically designed for AI development. The platform offers high performance computing power and scalability, which are needed for training AI models and supporting AI use cases such as research, medical diagnostics, fraud detection, and others. DGX is exclusively an NVIDIA make-model product.
- HGX
NVIDIA’s HGX platform is designed for hyperscale environments, such as data centers and large cloud providers. It allows multiple GPUs to work in tandem, employs high-bandwidth interconnect technologies between GPUs, and offers a scalable, modular architecture so organizations can build customized environments for complex AI workloads that demand high-performance computing power. HGX is well suited for training large neural networks, supporting scientific modeling, and data analytics among other use cases. SXM-based GPUs (described later in this blog), the same accelerator found in DGX, are integrated in non-NVIDIA OEMs’ popular compute product lines. For instance, you can get HGX in HPE ProLiant, HPE Apollo, Dell PowerEdge, Lenovo ThinkSystem, Supermicro SuperServer, etc. The server models (MGX, OVX, EGX and IGX) listed next are also available from non-NVIDIA OEMs.
- MGX
The emphasis on MGX is modularity. This platform makes it easier for organizations to build high-performance computing systems customized for specific AI workloads. With the MGX platform, engineers can select different NVIDIA components (i.e. GPUs, CPUs, DPUs, network) and configure them into systems that will ensure given workloads have the power they need for optimized performance. It can accommodate accelerators with varying interface types including SXM, PCIe as well as NVL.
- OVXWhen you see this acronym, think complex 3D virtual environments like digital twins and the Metaverse. OVX provides the infrastructure and power behind creating and operating realistic virtual worlds and is well suited for design, engineering, and entertainment applications that require large-scale, real-time simulation and rendering of virtual environments.
- EGXThe ‘E’ here stands for edge. NVIDIA’s EGX platform pushes AI to the edge of your network, enabling data to be processed closer to where it is created. This enables real-time processing of data and eliminates the need to transport data to a central data center. Industry use cases vary, but some, like autonomous driving, smart cities, and robotics, cannot function well (if at all) without real-time analysis and decision-making at the edge.
- IGXThis platform is similar to EGX, meaning it is focused on enabling AI at the edge, but it takes safety and security to another level. IGX targets AI operating environments where real-time processing is required, and precision, reliability, and safety cannot be compromised. These include many industrial IoT and autonomous computing use cases, as well as regulated systems in healthcare and other industries. If you have an edge environment that is safety-critical or where compliance to strict regulations is a must, then IGX is the platform to consider.
- Grace SuperchipNVIDIA announced the Grace Superchip in 2021. It is specifically targeted to deliver exceptionally high computing power with high memory capacity. It is ideal for training large AI models, performing scientific computations, and conducting data analytics on a large scale. It is based on Arm Neoverse V2 processors.
- Grace Blackwell SuperchipGrace Blackwell (GB) is another superchip formed out of 2 Blackwell accelerators and 1 Grace processor. The chips of accelerators and processor are connected on a module, using NVLink-C2C interconnect. It was announced in the March of 2024 during NVIDIA’s annual conference GTC 2024. It is available in NVL72, a new NVIDIA rack-scale server architecture consisting of 18 dual GB superchip servers, giving a combined massive 72:36 ratio of GPU:CPU. It is suitable for extremely large-scale training and inference use cases.
- NVIDIA Certified and Qualified SystemsIf you’ve been researching AI infrastructure, you’ve likely come across this term (if you haven’t yet, you will.) A hardware solution designated as an “NVIDIA certified and qualified system” means it has been tested by NVIDIA engineers against a rigid set of performance and reliability specifications, and it performed well. As a result, NVIDIA deems the solution acceptable and stamps its approval on it.
Other Terms to Know
As you explore the solution that is best for your organization, you will encounter a few more terms worth noting.
- SXMThis acronym stands for Server PCI Express Module. It is a form factor and interface designed to deliver high data transfer speed between GPUs while optimizing efficiency. Compared to PCIe-based interfaces, SXM performs better, delivering high-bandwidth and reduced latency with optimal thermal and power management.
- NVLinkThis is NVIDIA’s high-bandwidth interconnect technology. The acronym refers to the broader NVLink Switch System. It offers extremely fast data transfer speeds between GPUs and other components, with some models delivering up to 900 GB/s.
- BasePODNVIDIA’s BasePOD is a reference architecture. It integrates all the necessary computing elements—GPUs, software, and networking—into a foundational element for AI infrastructure that scales easily. BasePODs have flexibility of networking technology choice (Ethernet or InfiniBand).
- SuperPODA SuperPOD is a scaled-out version of BasePOD, scaling in units of measure called “Scalable Unit (SU)”, designed to support AI computing at a massive scale. An SU is either 20 or 32 nodes depending on the accelerator generation (Ampere or Hopper or Blackwell). It is ideal for demanding AI workloads that require supercomputing capability, such as scientific computing, training complex AI models, and other high-performance computing use cases. SuperPODs mandate the use of a specialized networking technology – InfiniBand.
- GPU ArchitectureNVIDIA has a wide range of accelerators, and they are based on different architectures. They started with Pascal, found in DGX-1, the first release of DGX server. Then came Volta, Ampere, Hopper, Blackwell and Grace Blackwell (GB). Except GB, all are available in SXM and PCIe form factors. Each new architecture brought with it newer PCIe generation technology and increased speeds and feeds. These days, during the AI gold rush, hearing about accelerators like A100, H200, B100 etc. is common, and it helps to know that they follow a certain architecture.
Choosing Your Infrastructure Solution
Selecting the right AI infrastructure is not easy, yet it is an important step that can affect the ultimate success of your AI projects. Many factors must be considered, beginning with a clear definition of use cases and outcomes and a thorough understanding of AI workloads.
Look for the next post in this series where we will dig deeper into the network component of AI infrastructure. As always, our goal is to offer insight and guidance that will help you move forward with building an optimized environment that will deliver on your AI strategy.
For more help with any stage of your AI journey, ePlus offers a comprehensive set of services. Check out ePlus AI Ignite for more information.