How do AI supercomputers train and run large GenAI models?

How do AI supercomputers train and run large GenAI models ? such as GPT4 that powers ChatGPT, and BingChat explained in this post.

Mélony Qin Published on May 20, 2024 0

Since the emergence of ChatGPT in 2022, AI has dominated discussions. However, behind the scenes, it’s the AI infrastructure that serves as the engine driving the market’s large GenAI models. These AI supercomputers process information millions of times faster than standard desktop or server computers. so, how do AI supercomputers train and run large GenAI models ?

Table of Contents

In this blog post, let’s take a look at what exactly an AI supercomputer is, and how do AI supercomputers train and run large AI models such as GPT3, GPT4, and even the latest GPT-4o that power ChatGPT, and BingChat.

HPC is something incredible that empowers AI supercomputer

Is HPC equal to AI Super Computers?

How do AI supercomputer srelate to HPC? AI supercomputers and High-Performance Computing (HPC) are closely related, they are often overlapping in capabilities and applications.

AI supercomputers, specialized for AI workloads, it may didn’t sound as exciting as quantum computing but they share parallels with HPC in processing vast data and performing complex computations at high speeds. Both rely on parallel processing for tasks like training large-scale AI models, benefiting from HPC technologies such as powerful processors, GPUs, and high-speed interconnects. This synergy enables AI supercomputers to leverage HPC capabilities, optimizing performance for demanding AI tasks like training deep learning models or image recognition algorithms.

How are GPT models trained?

When it comes to large AI model training, supercomputers sound like an even bigger deal. Supercomputers are the most powerful and advanced HPC systems available. So, what exactly does HPC bring to the table in this process?

Training and inference Process matters

Large AI models can be trained on massive datasets using HPC clusters, making AI model training and inference much faster. AI model training is about refining algorithms and adjusting parameters to improve their accuracy. This process involves feeding a large amount of curated data into selected algorithms. This enables the system to refine itself, generating precise responses.

AI model inference, on the other hand, is about applying those trained models to make predictions or decisions in real-time. Training jobs can consume vast amounts of computational resources, including CPU, GPU, or even specialized hardware like TPUs (Tensor Processing Units).

How do AI supercomputers train and run large GenAI models? — data center

Why GPU became such a big deal

This is why AI model training is largely driven by the emergence of GPUs and cloud-scale infrastructure, which require improvements in both specialized software and hardware. AI supercomputers can train models with hundreds of billions of parameters, as seen in the training and inference of large language models such as OpenAI’s GPT models.

For instance, the original GPT-3 had 175 billion parameters, which took thousands of GPUs running in parallel, and weeks to train on AI supercomputers built by Microsoft and OpenAI. The newer generation, such as GPT-4, is even larger. It is based on 8 models with 220 billion parameters each, totaling about 1.76 trillion parameters, connected by a Mixture of Experts (MoE), which is a machine learning technique in which multiple expert networks are used to divide a problem space into homogeneous regions. The method used to train these models is called data parallelism. This involves training multiple instances of the model simultaneously on small batches of data. After processing each batch, the GPUs share information and move on to the next batch. This necessitates training infrastructure of significant scale to handle the workload efficiently.

Those large AI models leverage self-supervised learning, where they learn by examining billions of pages of information over and over again. They are super resource-intensive and can quickly become expensive. Well, when I say expensive, it can mean time and money.

How long did it take to train the GPT models ?

OpenAI released GPT-3 in June 2020. Typically, training GPT-3 on a single NVIDIA Tesla V100 GPU would take an astonishing 355 years and would cost around $4.6 million.

However, Microsoft built a powerful supercomputer in its Azure data center, equipped with over 285,000 CPU cores, 10,000 GPUs, and high-speed network connections for each GPU server. This supercomputer, coupled with innovative software technologies like Deep Speed and ONNX Runtime, enables deep learning distributed training and inferencing. OpenAI utilizes it to employ advanced techniques like data parallelism to overcome memory limitations. This involves dividing the training data into smaller batches, allowing parallel processing across multiple GPUs. Therefore, GPT-3, equipped with 175 billion parameters and trained on 300 billion tokens, could theoretically have been trained in just 34 days using 1,024 A100 GPUs.

But the story doesn’t end there. GPT-4, introduced in March 2023, brings multimodality into the mix. This enhancement allows GPT-4 to process image and text inputs alongside improved reasoning capabilities and a deeper understanding of user prompts’ context. GPT-4 underwent training on approximately 25,000 Nvidia A100 GPUs for 90 to 100 days, resulting in a development period of about 5 to 6 months.

Learn NLP with transformers

Natural Language Processing with Transformers, Revised Edition

Since their introduction in 2017, transformers have quickly become the dominant architecture for achieving state-of-the-art results on various natural language processing tasks. If you’re a data scientist or coder, this practical book -now revised in full color- shows you how to train and scale these large models using Hugging Face Transformers, a Python-based deep learning library.

In this guide, authors Lewis Tunstall, Leandro von Werra, and Thomas Wolf, among the creators of Hugging Face Transformers, use a hands-on approach to teach you how transformers work and how to integrate them into your applications. You’ll quickly learn a variety of tasks they can help you solve.

Check it out on Amazon

The key challenge of AI Super Computers

The challenges for AI supercomputers go beyond just making them faster without sacrificing performance. On the hardware side, you need really high-bandwidth networking, such as InfiniBand, with high throughput but minimal latency, power consumption, and heat dissipation, which are also on the table. On the software side, you would need frameworks such as Deep Speed, developed by Microsoft Research, for large-scale deep learning training and inferencing, efficiently and effectively.

Things may look perfect on the footprint, but what’s really important in real life is handling failures that occur when running intensive workloads on a regular basis at such a large scale, such as servers going down or network connections becoming unstable. It’s important to reduce the frequency of these failures and, when they do happen, quickly figure out what went wrong and fix it.

The savior of AI Super Computing

But, you know what? Today, we have the technology to tackle this challenge. When the training goes wrong, the multiple GPUs required to train LLMs don’t operate on the same board or even the same rack in the data center. They’re spread out to manage power consumption and heat efficiently. This setup ensures scalability, enabling networks and clusters to expand to thousands of units as demands grow.

You might ask, “So why should we care about this ?” In reality, training large AI models could take days, weeks, or even months. What if something goes wrong midway? That’s why it’s crucial to periodically save the AI model’s state in an incremental fashion. If anything fails, it can quickly resume from the recent changes instead of starting from scratch by creating a checkpoint. This is where Project Forge from Microsoft comes in.

This technology, along with the global GPU scheduler, enables you to pause a job, then smoothly migrate it to another cloud region with minimal progress disruption. In fact, the same technology also performs load-balancing for jobs on a global scale. Virtual clusters assign these jobs, mapping the physical resources in the data center. BingChat, Github Copilot, and OpenAI ChatGPT utilize this technology, ensuring continuous training and inference jobs.

Looking forward

If you’re interested in similar topics, please feel free to follow me here on my newsletter and join me in writing about AI infrastructure. Or you can also subscribe to my YouTube channel . Let me know your thoughts in the comment section. Stay tuned, and see you in the next one!