IBM’s research division has created a cloud-native supercomputer that can be rapidly deployed and used to train foundation artificial intelligence models. Known as Vela, it has been in use by dozens of IBM researchers since May 2022 to train models with tens of billions of parameters, the company revealed this week.
Foundation models are AI models that have been trained on a broad set of unlabelled data. Their generic nature means they can be used for a range of different tasks with minimal fine-tuning but they are massive and require extensive and expensive computing power.
Some experts have said that compute power will become the biggest bottleneck in the development of even larger, next-generation foundation models due to the time it would take to train them.
Training these models, which can run to tens or hundreds of billions of parameters, require high-performance computing hardware including networking, parallel file systems and bare metal nodes that are hard to deploy and expensive to run. They are also not designed with AI in mind as they were built for modelling or simulation tasks, IBM researchers explained in a blog post.
They do work for AI, and there are AI supercomputers including Microsoft’s own AI supercomputer built for OpenAI in May 2020 and hosted in Azure. But IBM says they are hardware driven, with choices that increase costs and limit flexibility. So they created a system that was “exclusively focused on large-scale AI” and came up with Vela.
It has been designed to be deployed into any IBM Cloud data centre as needed and is in itself “a virtual cloud”. This approach led to a small hit in productivity over building a physical, on-premises supercomputer, but created a more flexible solution. The cloud solution gave engineers resources through an API interface, easier access to the broad IBM Cloud ecosystem for deeper integration and the ability to add performance as needed.
It is able to access datasets on the IBM Cloud Object Store rather than building a custom storage back end, access security practices through the IBM Cloud VPC and other existing infrastructure that would otherwise have to be built out separately into a supercomputer, IBM engineers explained.
A key component of any AI supercomputer is a lot of GPUs and nodes to connect them. IBM configured each node as a virtual machine rather than bare metal, the most common approach and widely seen as the most optimal in terms of AI performance.
How IBM Vela came together
The engineers felt that the flexibility of virtual machines was worth the sacrifice as it would enable service teams to provision and re-provision infrastructure with different software stacks for different AI users and make updates rapidly. It could also be scaled dynamically, with resources shifted between workloads in minutes.
To solve the reduced performance and deliver the bare-metal performance inside the virtual machine they found a way to expose the full capabilities of the node, including GPUs, CPUs, networking and storage into the virtual machine by reducing the overhead to less than 5%.
This involved configuring the bare metal host for virtualisation with support for virtual machine extensions, huge pages and single-root IO virtualisation, and then faithfully representing all devices and connectivity inside the virtual machine.
This included matching networking cards to the correct CPUs and GPUs, and how they are connected to sockets and to each other. Once they completed this and created the template, they found a “close-to bare metal performance” from the virtual machine nodes.
They also worked on designing the AI nodes to have large GPU memory and a significant amount of local storage for caching AI training data, models and artefacts. In testing with PyTorch they found that by optimising workload communication patterns, they were also able to compensate for the relatively slow bottleneck of ethernet networking, compared to the faster Infiniband-like network used in supercomputing.
Each of the Vela nodes has eight 80Gb A100 GPUs, two 2nd-generation Intel Xeon Scalable Processors, 1.5TB of DRAM and four 3.2TB NVMe drives. It can use any of the IBM cloud services and be deployed at any scale to any IBM cloud data centre in the world. It has been designed to operate on the public cloud or be deployed within an on-premises data centre.
“Having the right tools and infrastructure is a critical ingredient for R&D productivity,” the IBM engineers wrote. “Many teams choose to follow the tried and true path of building traditional supercomputers for AI […] we’ve been working on a better solution that provides the dual benefits of high-performance computing and high-end user productivity.”