A critical vulnerability in the Nvidia Container Toolkit, tracked as CVE-2024-0132, has been discovered by cybersecurity researchers at cloud security startup Wiz.
According to the researchers, this flaw impacts artificial intelligence (AI) applications in both cloud and on-premise environments that utilise graphics processing unit (GPU) resources, allowing attackers to escape container environments and gain full control of the host system. This access could enable them to execute commands or exfiltrate sensitive data.
The Nvidia Container Toolkit is widely used across AI-focused platforms and virtual machine images, particularly those involving Nvidia hardware. As per Wiz Research, the vulnerability affects more than 35% of cloud environments. The discovery of this flaw raises concerns for any AI application reliant on the toolkit to enable GPU access.
On 26 September, Nvidia released a security bulletin alongside a patch to address the issue. Wiz Research, which identified the flaw, noted that the GPU firm “worked with us throughout the disclosure process.” Organisations using the toolkit are being advised to upgrade to version 1.16.2 immediately, focusing on hosts that may run untrusted container images, as these are especially vulnerable.
The vulnerability allows an attacker to escape the container and gain full access to the host system, posing serious risks to sensitive data and infrastructure. The risk is heightened in environments that permit the use of third-party container images, as attackers could exploit this vulnerability through a malicious image.
In shared compute setups like Kubernetes (K8s), an attacker could escape from one container and access the data and secrets of other applications running on the same node or cluster, potentially compromising the entire environment.
Nvidia Container 101
The Nvidia Container Toolkit facilitates GPU access within containerised applications and has become a standard tool in the AI industry. The vulnerability extends to the Nvidia GPU Operator, which manages the toolkit in Kubernetes environments. This broadens the risk across various organisations using GPU-enabled containers.
All versions of the Nvidia Container Toolkit up to and including v1.16.1, as well as the Nvidia GPU Operator up to and including v24.6.1, are affected by this vulnerability. Use cases involving the Container Device Interface (CDI) are not impacted.
To mitigate the risk created by the vulnerability, organisations should upgrade to the latest versions: Nvidia Container Toolkit v1.16.2 and Nvidia GPU Operator v24.6.2. Patching should be prioritised for hosts running untrusted container images or vulnerable versions of the toolkit. Further protection can be achieved through runtime validation to confirm where the toolkit is in use, as per Wiz.
The vulnerability can be exploited through various attack vectors, including social engineering, supply chain attacks on container image repositories, or environments that allow external users to load arbitrary container images. While internet exposure is not necessary for the attack to occur, attackers may still attempt to use malicious images through indirect methods such as social engineering.
Wiz Research’s investigation into AI service providers led to the discovery of this vulnerability, initially driven by questions about whether shared GPU resources could expose customers’ data to attacks. This prompted a deeper exploration of Nvidia’s GPU-related tools, culminating in the identification of this significant security flaw.
Organisations relying on the Nvidia Container Toolkit are being strongly urged to take immediate action by applying the patches to avoid potential exploitation of their systems.
Nvidia’s continued dominance of AI chip market
Earlier this year, Nvidia CEO Jensen Huang introduced several new products, describing the company’s position in the evolving technology landscape as part of a “new industrial revolution.” At Nvidia’s GPU Tech Conference (GTC), Huang announced the GB200, featuring two Blackwell graphics processing units (GPUs) and a Grace central processing unit (CPU), which has fueled growth in generative AI.
The GB200 will power Nvidia’s Blackwell AI computer system, designed for trillion-parameter AI models to enhance generative AI capabilities. Huang noted that Blackwell GPUs, with 208 billion transistors, offer a major leap in computing power, performing some tasks up to 30 times faster than the H100 GPU. Companies like Amazon, Google, Microsoft, and OpenAI are expected to use the chip in their cloud services and AI applications.