Open source energised AI. LLMs complicate matters

Developers from around the globe have helped fuel the open-source AI boom. (Image by REDPIXEL.PL / Shutterstock)

The end seems nigh for bad pun writers. That, at least, is the official conclusion from Tech Monitor on ‘pun papa biden,’ a new tool built using an open-sourced Large Language Model (LLM) and designed to generate reassuringly groan-inducing dad jokes in the dulcet tones of the 46th President of the United States. “Did you hear about the guy that was stuck in a tree?” the open source model asks our brave reporter. A pause, as our humble publication girds itself for the punchline. “He’s still on that branch.”

Clearly, real comedians need not fear for their jobs. But despite its humorous limitations, ‘pun papa biden’ is one of a growing number of impressive off-the-wall tools built using open-source LLMs. These models have displayed immense improvements in power and sophistication in recent months. Keerthana Gopalakrishnan, a software developer based in California and the brains behind the latest AI presidential pun generator, says she was surprised by the power and accessibility of RedPajama 3B, the freshly released open-source model she used as the basis for her project.

These soaring abilities have left the open-source community at an existential crossroads. While pun generation should, by rights, be considered (mostly) harmless, open-sourced LLMs could also be harnessed by actors with much darker motivations. Stripped of all the safety guardrails that big companies have been struggling — if not always successfully — to strap on, some fear these models could be used to launch devastating cyberattacks, automate the spread of misinformation, or assist online fraudsters in pumping out sophisticated phishing emails on an industrial scale.

Many argue, despite these risks, that open-source models are a necessary counterweight to the global dominance of companies like Meta and Google. That, at least, is the dream embraced by most hobbyist LLM developers: the creation of a new generation of language models capable of doing almost everything their Big Tech cousins can manage, at a tiny fraction of the price.

A person typing open source code AI onto a laptop. — Developers from around the globe have helped fuel the open-source AI boom. (Image by REDPIXEL.PL / Shutterstock)

The battle between open-source AI and closed-source AI

Open-source software “has long been the backbone of AI,” says generative AI expert Henry Ajder. The principle of taking code and publishing it for all the world to see and tinker with has remained more or less unquestioned among the AI research community, and has been credited with supercharging the technology’s development. Even so, says Ajder, while most developers have good intentions in sharing their source code, they’re also unintentionally supplying bad actors “with the foundations that can be used to build some pretty disturbing and unpleasant toolsets.”

OpenAI agrees. Despite its name, the company is now a closed-source operation, meaning that the code behind the popular ChatGPT and GPT-4 cannot be copied or modified. What’s more, the firm seems to regret its earlier enthusiasm for releasing its models into the wilds of GitHub. “We were wrong,” OpenAI co-founder Ilya Sutskever told The Verge. “If you believe, as we do, that at some point AI, AGI [Artificial General Intelligence], is going to be extremely, unbelievably potent, then it just does not make sense to be open-source.”

Detractors argue that the company’s rejection of its old ideals might be a convenient way to bolster its coffers — a marketing tactic that imbues a sense of mystery and power in a technology that many coders outside its corporate walls seem perfectly capable of honing without worrying about unleashing a superintelligence. Others, meanwhile, have profound ethical objections to closed-source toolsets. They warn that AI is an extremely powerful tool which, if reserved to just a few large companies, has the potential to hypercharge global inequality.

This isn’t just a theoretical proposition. Open-source LLMs currently enable researchers and small-scale organisations to experiment at a fraction of the cost associated with their closed-source cousins. They also enable developers around the globe to better understand this all-important technology. Gopalakrishnan agrees. “I think it’s important to lower the barrier to entry for experimentation,” she says. “There are a lot of people interested in this technology who really want to innovate.”

What’s behind the open-source AI boom?

Developers got a big boost from Meta’s powerful LLaMA, which leaked online on March 3rd, just one week after its launch. This was the first time that a major firm’s proprietary LLM had leaked to the public, thus making it effectively open-source. Although licensing regulations prevented LLaMA — and its derivatives — from being used for commercial purposes, it still helped developers accelerate their understanding and experimentation. Numerous LLaMA-inspired models were soon released, including Stanford’s Alpaca, which added a layer of instruction-tuning to the model.

A key accelerator in the development of open-source LLMs has been the popular adoption of LoRA, which stands for Low-Rank Adaptation. This technique allows developers to fine-tune a model at a fraction of the cost and time — essentially enabling researchers to personalise an LLM on ordinary hardware in just a few hours. Gopalakrishnan used LoRA to train ‘Pun Papa Biden’ in less than fifteen hours while at a hackathon in California.

LoRA is also stackable, meaning that improvements made by different contributors can be layered over each other to produce a highly-effective collaborative model. This also means that models can be swiftly and cheaply updated whenever new datasets become available. These iterative improvements might ultimately enable these models to dominate over the giant — and hugely-expensive — models produced by the likes of Google and OpenAI.

A leaked document, whose author was identified by Bloomberg as a senior software engineer at Google, suggests that Big Tech is getting worried. ‘The uncomfortable truth is, we aren’t positioned to win this arms race and neither is OpenAI,’ the document reads ‘While we’ve been squabbling, a third faction has been quietly eating our lunch.’

That faction was, the author quickly clarified, open-source AI. It cost more than $100m to train GPT-4, according to OpenAI CEO Sam Altman. Researchers at UC Berkeley, meanwhile, released Koala in early April — an open-source ChatGPT-equivalent based on LLaMA and trained exclusively on freely-available data. On public cloud-computing platforms, the researchers estimate that training Koala will typically cost under $100. Through ChatGPT, OpenAI lowered the barrier to using LLMs. Open-source development, meanwhile, has lowered the barrier to fine-tuning them and personalising them.

ChatGPT AI open source — ChatGPT crossed the one million user mark just five days after it was made public in November 2022. (Image by Giulio Benzin / Shutterstock)

LLMs are headed towards a niche future

The future of LLMs will be focused on “getting more out of less,” says Imtiaz Adam, founder of Deep Learn Strategies Limited. This echoes Altman’s message to an audience at MIT in April. In the future, he declared, the AI community will be less concerned with developing ever-larger models than wringing out as much utility as possible from the models that they already have. In short, he argued, “We’ll make them better in other ways.”

Open-source collaboration, says Adam, has already made major steps towards achieving this goal, producing innovations that “dramatically reduce the costs and the complexity” of LLMs and boost both customisation and privacy. Using less computational resources also lowers both costs and carbon emissions. This is particularly important for smaller companies trying to get off the ground, but might also factor into larger enterprises’ ESG commitments and their desire to court climate-conscious consumers. Size also matters. “Increasingly, AI is going to be in all the devices around us,” says Adam. That means we’ll need smaller models that can work on a standard mobile device.

Smaller software companies are also trying to capitalise on businesses’ growing desire for installable, targeted, and personalizable LLMs. In April, Databricks released an LLM called Dolly 2.0, which it claimed was the first open-source, instruction-following LLM designated for commercial use. It has ChatGPT-like functionality, says Databricks, and can be run in-house.

Legislators are scrambling for safety guardrails

Companies like Amazon Web Services and IBM have also begun launching their own enterprise-grade LLMs, which include guardrails for corporate use. Experts like Ajder predict that such guardrails will need to become the norm — and be further tightened — if legislation is to prevent the potential misuse of increasingly-powerful personalisable LLMs.

But is it possible to truly balance the need for safeguarding against the principles of open-sourcing technology? So far, the jury’s out. Stanford’s Alpaca — one of the early open-source LLMs developed from Meta’s LLaMA — was taken offline shortly after its release due to its propensity to hallucinate and fears of misuse by bad actors. “The original goal of releasing a demo was to disseminate our research in an accessible way,” a spokesperson from Stanford told The Register. “We feel that we have mostly achieved this goal, and given the hosting costs and the inadequacies of our content filters, we decided to bring down the demo.”

Ryan Carrier, CEO of AI education platform ForHumanity, says that all providers of customisable LLMs will ultimately need to clearly define acceptable use cases in their Terms & Conditions and create some kind of monitoring system to ensure users are abiding by these rules. “Failure to do so will open them up to enormous liability when users deploy the tool for harm,” says Carrier.

Ajder agrees. “I think lawmakers are going to be increasingly thinking about open-source as a potential security risk,” he says. “If the community itself doesn’t grapple with these issues meaningfully, legislators will come for them in a way that will fundamentally undermine what they’re doing.” He believes that the biggest hosting platforms, such as GitHub, need to employ more robust and timely moderation to ensure safe proliferation of these tools. Indeed the biggest risk, argues Ajder, comes from accessibility. Elite criminals can probably generate their own malicious LLMs systems without the support of the open-source community but democratisation, despite its clear benefits, lowers the barrier of entry to criminality.

“History tells us that democratisation in the tech sector yields accessibility, innovation, and community,” says Coran Darling, a technology lawyer at DLA Piper. Preventing bad actors from misusing open-souced LLMs, however, will require the government to take an interest in the implementation of at least some legislative guardrails. This, says Darling, “might be the happy medium” that both ensures users are aware of their rights and that corporations can deploy and employ such models safely “without stifling all the positives democratisation can bring.”

For Gopalakrishnan, it’s important not to get carried away with the worst-case scenarios for open sourcing LLMs. In balance, she argues, allowing developers to tinker and hone their own versions of these powerful models is probably a net good for the field. “I think living in a world with a lot of competition in AI is important,” says Gopalakrishnan. After all, “competition brings out the best in everyone,” she says.