Databricks open-sources its Dolly large language AI model

Databricks says it was able to achieve similar chat-like functionality from an older, smaller language model (Photo: rarrarorro / Shutterstock)

In an attempt to open up its technology to a wider audience, enterprise software company Databricks has released Dolly, a large language model and its associated training code under an open-source licence. Despite being based on a much smaller underlying model, the company says it has ChatGPT-like functionality and can be run “in-house”.

The move was inspired by the success of OpenAI’s natural language platform ChatGPT, which became one of the fastest-growing consumer apps within a couple of months of its release in November last year. It has since caused some of the world’s largest companies including Microsoft and Google to pivot and release generative and natural language AI tools.

“We show that anyone can take a dated off-the-shelf open source LLM and give it magical ChatGPT-like instruction-following ability by training it in 30 minutes on one machine, using high-quality training data,” Databricks wrote in a blog post explaining the decision.

It found that the type of instruction-following used in ChatGPT “does not seem to require the latest or largest models”, and claims that from just six billion parameters, compared to 175 billion in GPT-3 and many more in GPT-4 or Google’s PaLM, it was able to recreate the functionality of ChatGPT.

“We believe models like Dolly will help democratise LLMs, transforming them from something very few companies can afford into a commodity every company can own and customise to improve their products,” the company said.

Large language models: from LLaMA to Alpaca to Dolly

Developers like OpenAI, Anthropic, AI21 Labs, as well as Microsoft, Google and IBM charge end-users for access to their large language models through API calls. This can become expensive very quickly if you need to make a lot of calls on a regular basis. Alternatively, training those same models is an expensive endeavour that takes hundreds of GPU hours and trillions of words from datasets.

Then Meta released the weights for its high-quality language model, LLaMA, to researchers. It had been trained using more than 80,000 GPU hours, with Stanford University-built Alpaca, on top of LLaMA, tuned to a subset of 50,000 human-like questions and answers. This led to it exhibiting ChatGPT-like functionality from a relatively small training dataset.

Dolly, from Databricks is able to deliver what the company describes as a “surprising degree of instruction-following capabilities” but from a much smaller model. Where the Alpaca team demonstrated that a state-of-the-art model could be used as a chatbot engine, Databricks says even years-old models can be tweaked to have those same types of behaviours if fine-tuned on a small corpus of instruction training data.

“Dolly works by taking an existing open-source six-billion-parameter model from EleutherAI and modifying it ever so slightly to elicit instruction following capabilities such as brainstorming and text generation not present in the original model, using data from Alpaca,” the company explained.

The team were surprised it worked so well given the older and smaller nature of the underlying model compared to those provided by OpenAI or Google. “This suggests that much of the qualitative gains in state-of-the-art models like ChatGPT may owe to focused corpuses of instruction-following training data, rather than larger or better-tuned base models.”

“We’re calling the model Dolly — after Dolly the sheep, the first cloned mammal — because it’s an open-source clone of an Alpaca, inspired by a LLaMA. We’re in the earliest days of the democratisation of AI for the enterprise, and much work remains to be done, but we believe the technology underlying Dolly represents an exciting new opportunity for companies that want to cheaply build their own instruction-following models,” said Databricks in a blog post.

Using an open model rather than sending data to a centralised LLM makes sense for companies with highly sensitive and proprietary data. Handing it over to a third party may be unpalatable to some companies and so making trade-offs in terms of model quality and cost, against the security of using in-house models have to be considered.

Dolly will be available on Databricks with the trained weights available to anyone wanting to experiment with the model. This is the first in a series of announcements from the company which is switching its focus to helping organisations harness large language models. “We believe in the incredible power of artificial intelligence to transform the productivity of every organisation and individual, and welcome you to join us on this journey. Stay tuned for more in this area in the coming weeks.”

Sign up for our weekly news round-up!

Sign up to the newsletter: In Brief

Large language models: from LLaMA to Alpaca to Dolly

Read more: UK AI regulation white paper dodges ChatGPT questions

Sign up for our regular news round-up!

Sign up for our weekly news round-up!

Sign up to the newsletter: In Brief

I would also like to subscribe to:

Thank you for subscribing