Is data poisoning actually a serious threat to AI?

Digital artists hope to use data poisoning techniques to protect their copyrighted artwork from being indiscriminately scared by AI image generation models like Midjourney. Some fear this attack may also be targeted against text-output models like ChatGPT or Copilot. (Image by Shutterstock)

When industrialisation threatens to marginalise an entire way of life, as in the case of artists confronting the rise of AI, that community tends to react in one of two ways. Most attempt to negotiate, preserving their rights as best they can under the new economic order – witness, for example, the recent deal won by the actor’s union SAG-AFTRA to guarantee compensation whenever its number have their voice or facial features usurped by AI. Others prefer a fighting retreat, ceding ground to the advancing force while scattering traps and snares in its path.

This, it seems, is the appeal of Nightshade to your average digital artist, who has spent the better part of a year watching their works being used as training data for generative image models like Midjourney and DALL-E 2. Co-developed by Ben Zhao, a professor at the University of Chicago, Nightshade works by imperceptibly changing the pixels of digital artwork in such a way as to “poison” any AI model ingesting it for training purposes. As a result, the model’s perception of the image is irrevocably altered, rendering it functionally useless as a means of informing future outputs: a person climbing a tree, for example, may instead be reproduced as a dormouse in a teapot, or a grinning Cheshire cat.

Making a model like Midjourney bug out in this way, Zhao wrote in an accompanying research paper, should only be used as a “last defense for content creators against web scrapers” that continue to crawl copyrighted artwork without the artist’s consent. But it’s clear that this so-called “data poisoning” attack could be used for other purposes, says generative AI consultant Henry Ajder. “It’s worth thinking about this in terms of a privacy context, as well,” says Ajder, who sees appeal for data poisoning software among those keen on preventing their features from training facial recognition algorithms or being used to make malign deepfakes.

Text-output models like ChatGPT and Bard could also be vulnerable, explains Florian Tramèr, an assistant professor of Computer Science at ETH Zürich. Researchers at Cornell University showed how this might be achieved with code generation applications like Copilot by training it on Github projects suffused with insecure code. The end goal, explains Tramèr, was to demonstrate how thousands of new vulnerabilities could be created almost without being noticed.

“One of the many examples they had was to poison the model so that whenever it’s being used on a file with a Microsoft header – so, a file that would be developed by someone at Microsoft – it would have a tendency to generate code that’s insecure,” says the researcher. “The hope there would be that if employees started using this model, suddenly the Windows operating system might have more bugs because of it.”

A suspect pill fizzing in a glass beaker, a metaphorical illustration of the concept of data poisoning in generative AI. — Digital artists hope to use data poisoning techniques to protect their copyrighted artwork from being indiscriminately scared by generative AI models like Midjourney. Some fear this attack may also be targeted against text-output models like ChatGPT or Copilot. (Image via Shutterstock)

A data poisoning primer

Recent breakthroughs in demonstrating the viability of data poisoning, explains Tramèr, “build upon a long line of research that has shown that machine learning models are actually surprisingly brittle, even though they perform extremely well.” These include cases where self-driving cars have been tricked into confusing the purpose of red and green lights on traffic signals, chatbots trained to respond to mundane inquiries with racist expletives, and spam filters convinced into allowing advertising flimflam to pollute the internet at large. Tramèr himself was listed as a co-author on one paper in August that demonstrated how web-scale datasets could be poisoned by attackers with pockets deep enough to buy up expired web domains.

Even so, data poisoning isn’t easy. It takes more than polluting a handful of data points to trip up most models. “It seems like machine learning models – especially modern deep learning models – are, for some reason that we don’t really understand, extremely resilient to this,” says Tramèr. What works best, it seems, is a targeted approach to poisoning the training set.

“One example of this is what people call a ‘backdoor attack’,” says Tramèr, “where I would take a very small amount of data that I would mislabel, but each of the images that I add into the model’s training set I would tweak by adding a small watermark. This then means the model can learn that this small watermark means I should do something bad without this having to affect how the model behaves on the 99% of the data that’s clean.”

Recent research also shows that similar results can be achieved with text-output models. “With text models, they tend to [exhibit] learned behaviour that’s a lot more general than vision models, [so] the attacks can also be quite a bit more general,” says Tramèr which, he continues, was demonstrated in the Cornell study.

In cases like that, explains Tramèr, hackers would probably hope that their data poisoning attack would lead the targeted generative AI model to unwittingly plant new vulnerabilities across thousands of APIs and websites. Another potential application for data poisoning may reside in search engine optimisation. “We already know that… many web developers try to tamper with the data of their own website to trick search engines into giving them a higher rating,” says Tramèr. If it’s known that an LLM is being used to pick and choose search results, he adds, then corporate actors may be tempted to try and include new forms of code to ensure results for certain products or services are artificially boosted in the same way.

The threat of data poisoning to generative AI

Should CIOs, then, be worried about the resilience of their own generative AI models against data poisoning attacks? Adjer says companies are going to be worried about this “in the context of, ‘well, what liabilities might emerge for us if our models are trained on data which leads to a higher frequency or severity of hallucinations?’”. But, he adds, that’s already a problem they can see being baked into current models, citing numerous cases where companies have sued, or have been sued themselves, after a generative AI model has produced false or misleading outputs.

Tramèr is also convinced CIOs have a bit of breathing room before data poisoning becomes a real concern. Injecting errant code into digital artworks using tools like Nightshade might trip up AI models now, he says, but future filtering techniques and generative model architectures will probably be able to swallow the poison with no ill effects. The same would presumably apply to facial recognition and deepfake algorithms, necessitating a new subset of arms race between the hacker and the hacked.

Whether such a contest would persist amid changes in copyright law to cope with AI or the sheer amount of effort it takes to launch a data poisoning attack is another question. For his part, Tramèr is sceptical that hackers would bother with repeated attacks against programs like Copilot for the simple reason that it’s much easier – and less time-intensive – to search and exploit vulnerabilities rather than create them de novo. It’s more likely, he adds, that SEO-based data poisoning attacks could be launched in the near term, simply because there’s so much money involved in maintaining search primacy for a given product or service.

Data poisoning is also still an academic exercise, says Tramèr. Part of the reason Nightshade is so exciting is that it’s one of the first launch vehicles for data poisoning in the wild. Almost every other application, explains Tramèr, has only been tested on small AI models that researchers can efficiently build and monitor in the lab. It remains unknown how effective any data poisoning attack would be against a much larger model like ChatGPT, Midjourney or Copilot.

It’s more likely that generative AI models will poison themselves, says Ajder. As the popularity of ChatGPT and DALLE-2 increases, so too will the number of AI-generated outputs published across the internet – outputs that will inevitably be hoovered into the training sets for future platforms and, some fear, corrupt them in a process known as “model collapse.” “In a world where the digital space is fairly saturated with AI-generated content, being able to filter that out when training new models is obviously going to be challenging,” says Ajder.

Tramèr shares Ajder’s concerns though, again, it’s a hypothesis that has only been tested on small, laboratory-appropriate models. In these, explains the Swiss researcher, “this model collapse effect is very, very severe,” but also to be expected given the relatively unsophisticated nature of these programs. What impact the ingestion of AI-generated content might have on models like GPT-4 is much harder to ascertain. This lack of certainty is partly why Tramèr continues to find the concept of data poisoning so fascinating.

“We have very, very few answers to relatively fundamental questions,” says the researcher. From a security perspective, that might be very scary indeed. At this moment in time, though, it may not, “ because, for now, no one’s really been able to show that this is something we should be worrying about.”