Foundation models may be the future of AI. They’re also deeply flawed

The Eiffel Tower during construction, March 26, 1888. A recent study from Stanford University argues that AI research and development is increasingly dependent on what it calls ‘foundation models,’ from which whole ecosystems of software applications can be built upon. (Photo: Getty)

For a while, foundation models seemed like the future. When it was first unveiled in May 2020, OpenAI’s GPT-3 language model stunned observers with its capacity to generate human-like text from prompts of just a few words. Suddenly, the demise of the professional writer – and any chance of curbing fake news – seemed nigh. GPT-3 was simultaneously praised as one of the most powerful AI tools ever developed and deemed too dangerous to be released as an open-source model.

Over subsequent months, however, significant flaws in GPT-3 began to emerge. An automated tweet generator built on the model demonstrated its shortcomings, as the quality of much of its output fell far short of the promising early examples advertised by OpenAI. Some of it was just nonsense: a tweet generated from the prompt ‘Zuckerberg’ resulted in a screed about the Facebook founder rolling up his tie and swallowing it. Other prompts generated a slew of offensive and racist stereotypes – the result, some speculated, of GPT-3’s 175 billion machine learning parameters being trained on masses of data scraped from the internet, warts and all.

Criticism of GPT-3 has persisted but OpenAI has continued to grant interested developers access to the model through an API. GPT-3 now forms the basis of a wide variety of applications, from search engines and interactive games, to resume generators and low-code app creation software powered by Microsoft Azure. While OpenAI has since instituted new safeguards aimed at curtailing some of the worst biases in GPT-3’s outputs, the fact that so many applications are now being built on the model is evidence of a worrying new trend in artificial intelligence, says Percy Liang, a computer science professor at Stanford University.

What are foundation models?

Models such as BERT, T5, Codex, DALL-E and CLIP now constitute the base layer for new applications in everything from computer vision and protein sequence research, to speech recognition and coding. The collective term for these systems is ‘foundation models’, as coined in a recent study by Liang and a long list of computer scientists, sociologists and philosophers. “These large models…are everywhere,” explains Liang. GPT-3 is just one example of a new breed of huge, self-supervised AI systems that, once introduced, have begun to dominate their respective fields.

The term ‘foundation models’, he explains, is meant to evoke the central importance of these systems within whole ecosystems of software applications. “A foundation of a house is not a house, but it’s a very important part of the house,” says Liang. “It’s a part of the house that allows you to build all sorts of other things on top of it. And, likewise, I think a foundation model is attempting to serve a similar role.”

Central to the success of any foundation, however, is its reliability. Liang acknowledges that “there is the aspirational idea of having a foundation model that is…the bedrock of AI systems.” Little that Liang and his colleagues found in their study on the phenomenon, however, suggested that this was yet the case. Many foundation models, they argue, continue to follow the example of GPT-3 in sourcing their training data indiscriminately from the internet. Worse, the investment required to create them – GPT-3 cost $12m to develop – means that they are largely proprietary, and lie beyond the bounds of public accountability.

Liang and his colleagues published their 212-page study in August. Far from being greeted as a prescient warning for the AI research community, however, it became a subject of fierce debate. Many critics couldn’t accept that the emergence of foundation models constituted a paradigmatic shift at all. Models like BERT and GPT-3 do not exhibit “what constitutes a real ‘understanding’ of language,” says Professor Ernest Davis of New York University. As such, the term ‘foundation’ suggests “a lot more than their delivery. They’re not foundational for AI, or anything else.”

How do foundation models learn?

When it comes to foundation models, all roads lead to BERT. Short for ‘Bidirectional Encoder Representations from Transformers,’ the natural language processing algorithm was developed by Google to better interpret the context underlying search queries. Trained using a method known as self-supervised learning, which allows a model to make connections between reams of unstructured data, Google began using BERT on US search enquiries in October 2019. Two months later, it began handling queries in more than 70 other languages.

BERT wasn’t especially unique. While its accuracy was certainly impressive, its training was built on long-established techniques in NLP research. What made it stand out was its size, at 345 million parameters, and Google’s willingness to let anyone who wanted to train their own, smaller models on the system. Consequently, instead of training their own models from scratch, developers have begun to lean heavily on BERT and similar self-supervised learning models. “BERT really changed the paradigm,” says Liang.

What makes models like BERT so attractive, says Liang, is their adaptability. Once they undergo their initial training stage, self-supervised models can be further fine-tuned on a wide range of smaller, more specific downstream tasks. Their potential influence, argues Liang, is therefore vast. “Smaller models you see all the time in different applications, such as law and medicine,” he explains, applications that take time and investment to research and deploy for specific tasks. The temptation to ditch them for a cheaper suite of systems based on a vast, central core model will become all the more appealing in the future.

What’s more, foundation models offer significant advantages. In healthcare, for example, Liang and his authors argued that they ‘may improve the delivery of care to patients,’ in addition to boosting the efficiency of drug discovery and the interpretation of electronic health records. Education, meanwhile, stands to benefit. Programs like MathBERT, for example, have been developed to track a pupil’s understanding of a given problem based on their past responses – essential, the study says, ‘to improving their understanding through instruction.’

The risks of foundation models

But the risks introduced by such a paradigm shift are equally enormous. Centralising application development in this way naturally means that if a flaw exists in the foundation model, it exists in all the other smaller programs built on top of it. The risk of entrenching existing societal biases, for example, is much higher than it would be within a constellation of smaller, specialised neural networks.

Foundation models also introduce new security vulnerabilities. “Think about data poisoning attacks,” says Liang. “Anyone can just put a piece of text or a piece of code on the internet, and some big organisation training a foundation model will happily scrape it.” Foundation models also threaten the privacy of ordinary citizens. One recent study found that GPT-2 could be hacked to reveal emails, phone numbers and names acquired from its training data. ‘Worryingly,’ said the authors, ‘we find that larger models are more vulnerable than smaller models.’

The sheer expense involved in creating such models also means that their creation will inevitably fall to those corporations that can afford the investment. So far, that’s meant a handful of big tech companies, including Microsoft, Google and Facebook. The lack of public accountability worries patent lawyer Bertrand Duflos, who compares the situation to a legal doctrine of ‘essential facilities.’ When a single port grants sole access to a city, for example, it is “not acceptable that a single company will operate the harbour and impose its conditions”. Foundation models command a similar level of influence over the fortunes of whole reams of software applications – the only difference being the comparative lack of regulation on how far that influence extends.

Crying wolf

There is another corps of AI researchers, however, that feel that the Stanford study overstates the capabilities and influence of models like GPT-3, BERT and DALL-E. For this group, ‘foundation’ models imply a measure of understanding text and images among them that simply doesn’t exist. “These models are really castles in the air,” said Jitendra Malik, a professor at UC Berkeley, during a breakout workshop on the subject. “The language we have in these models is not grounded. There is this fakeness. There is no real understanding.”

Davis agrees. In an article published in The Gradient, the NYU professor compared the outputs of GPT-3 as closer to parlour tricks than actual intelligence. For such systems, words themselves are “meaningless,” he says, “unconnected to any reality,” says Davis. What large language models (LLMs) like GPT-3 demonstrate is not a mastery of composition but the probabilities of one word appearing before or after another in a given sentence or paragraph. “This is not what constitutes a real ‘understanding’ of language,” adds Davis. “It’s a stochastic parrot.”

Liang, for the most part, agrees. “But I do think that we’re going to have models that are going to take longer and longer for us to figure out that they don’t understand,” he says. “It’s likely that we’re going to have models that can pass off entire documents and be more personalised.”

Nevertheless, some argue that the recent attention lavished on foundation models has obscured much of the work done in exposing the biases rife in LLMs, which constitute the majority of systems that fall under the definition. “If someone googles ‘foundation model,’ they’re not going to find the history that includes Timnit [Gebru] being fired for critiquing large language models,” AI Now institute director Meredith Whittaker told Morning Brew. “This rebrand does something very material in an SEO’d universe.”

Other scholars, meanwhile, fear that the sheer weight of academic heft behind the Stanford study will deepen the growing interest in testing applications based on huge and complex models, to the detriment of many other research areas. “LLMs & associated overpromises suck the oxygen out of the room for all other kinds of research,” tweeted AI researcher Emily Bender.

Davis has witnessed a similar fixation on such models across academia. “You can hardly get a paper published these days unless it is based on an LLM,” he says. “All that people are doing is taking an LLM and throwing it at some type of problem – maybe computing accuracy – and then that’s a paper.”

Last month, Liang and fellow Stanford academic Rishi Bommasani published a detailed riposte to these criticisms. As far as the term ‘foundation models’ was concerned, both authors argue that it was never their intention to overegg the capabilities of these types of neural networks, but rather to emphasise the vulnerabilities they perpetuate within AI ecosystems. This, they said, is why they chose the term ‘foundation’ over ‘foundational.’ Even among the authors of the study, they added, “foundation model” is not everyone’s favourite.

Neither was it the intention of the study to warp the priorities of the AI research community, says Liang. Rather, the Stanford professor hopes that will serve as a positive example for more interdisciplinary examinations of artificial intelligence. “AI has been primarily driven by computer scientists, technologists who for decades were trying to get anything working,” says Liang. “What’s blindingly clear now is that this is not sufficient. We need more interdisciplinary work. We need social scientists, and ethicists and political scientists to take a look at the whole problem.”

AI has been primarily driven by computer scientists, technologists who for decades were trying to get anything working. What’s blindingly clear now is that this is not sufficient.
Percy Liang, Stanford University

This needs to be coupled with fundamental reform on data acquisition, says Liang. If the future of AI is to be dominated by foundation models, then they need to be trained on more discriminating sources of information than the entire internet. One possible solution, argues Liang, is a National Research Cloud. A vast database curated with care by the US AI research community would offer an alternative to scraping information wholesale from the internet, and go some way in reducing the biases that have crept into foundation models.

While this solution has been criticised for the benefit this will inevitably have for private cloud suppliers, the idea has already won support in the Biden administration, who created a task force in June to scope out what its implementation would look like. Another solution may involve instituting new rules on the quality and provenance of data being used in training sets. Such an approach is already being trialled at Getty Images, where AI models are routinely screened for bias using a series of bespoke tests.

Whatever regulations and practices do emerge, the outsize influence acquired by models like GPT-3 and BERT seems inevitable in retrospect. “The whole history of machine learning and deep learning and AI has been one of centralisation,” says Liang, underpinned by a handful of common architectures. Now is the time to act, he adds, before the harms of individual models are perpetuated across application ecosystems: “What we’re trying to do is offer some sort of constructive solution.”

Sign up for our weekly news round-up!

Sign up to the newsletter: In Brief