Is AMD narrowing the AI gap on Nvidia?

MosaicML put the AMD MI250 against the Nvidia A100 and had both train different sized large language models (Photo: Jimmy Tudeschi / Shutterstock)

AMD-built artificial intelligence chips are “almost” as fast as the industry leading devices from Nvidia. That is according to a new study by Databricks-owned AI software company MosaicML which found AMD’s technology achieved 80% of Nvidia’s performance when training large language models and performing other AI-intensive tasks.

Nvidia currently dominates the market when it comes to training AI models such as those used to run ChatGPT or Midjourney. The success of these products and demand for compute power has pushed Nvidia to a $1trn valuation and sparked a shortage of GPUs.

MosaicML recently put AMDs M1250 GPUs to the test against the Nvidia A100s. Both devices, which are one generation behind their respective developer’s top of the range chip, were used to train large language models, with researchers finding that the AMD and Nvidia chips both worked “out of the box” in training the models and AMD had about 80% of the Nvidia performance.

The team trained models ranging from one billion to 13 billion parameters, similar to those being used in enterprise to provide AI-driven tools for search and summary of large company datasets. They were trained on a single node of four GPUs and found the throughput of the MI250 was within 80% of the A100s. The MI250 had a slight edge in terms of floating-point operations per second and memory, which according to MosaicML allows for larger models per GPU.

The company plans to profile larger models on larger clusters of GPUs to confirm whether the AMD systems can perform at scale and are doing so in partnership with hyperscalers. There are also plans to create inference benchmarks and use other models like diffiusion models on both systems to test a wider range of options.

While the chips weren’t the top-tier products from each company, both are widely used in datacentres and in training AI models. MosaicML says new ML training hardware is necessary to “increase compute availability amid the Nvidia supply crunch”.

AMD driven by software

MosaicML says the AMD performance was related to a new version of the vendor’s software that was released last year and interacts with open-source AI software PyTorch. Hanlin Tang, MosaicML CTO says further software updates from AMD for the MI250 will allow it to match the performance of the Nvidia A100 by the end of the year.

He said that AMD had done particularly well in software, allowing it to keep pace with and catch up to Nvidia despite differences in hardware performance. Tang says its possible to switch to AMD without requiring changes to code bases or re-writing the large language model, adding that he believes “they’re essentially interchangeable”.

Tang said AMD did not pay it to conduct the research. His company produces software designed to make it easier for enterprise to create AI models and train them in-house rather than rely on tools from OpenAI or other large AI labs. He said the research was to show there are choices beyond Nvidia.

“Overall, we are incredibly optimistic about the future market for AI training hardware,” he said. “More good options means more compute supply, more market pressure on prices, and ultimately lower costs for users who want to train their own models.”

Databricks revealed it had paid $1.3bn for MosaicML last week as part of a wider effort to build an ecosystem of enterprise-ready open-source AI models. Both companies produce tools that make AI algorithms smaller and cheaper to run on large datasets but the MosaicML software will be used to enhance Databricks offering.

The report comes as Intel announced its long-term plans last week to compete on AI chips from 2025. It is shifting its strategy to focus on building products that go up against hardware from Nvidia and AMD.

Last week Intel announced its Falcon Shores chip will have 288gb of memory and support 8-bit floating point computation, which is important for training AI models. Intel also claims its Ponte Vecchio AI chip outperforms the Nvidia H100. The Ponte Vecchio has faced delays but it will be at the core of the latest supercomputer from the Argonne National Lab, with shipments due to be complete this year.

Sign up for our weekly news round-up!

Sign up to the newsletter: In Brief

AMD driven by software

Read more: France wants to become Europe’s capital for AI

Sign up for our regular news round-up!

Sign up for our weekly news round-up!

Sign up to the newsletter: In Brief

I would also like to subscribe to:

Thank you for subscribing