View all newsletters
Receive our newsletter - data, insights and analysis delivered to you

Databricks open-sources its Dolly large language AI model

Large Language Models typically take hundreds of GPU hours to train and most require sending data to third-party companies.

By Ryan Morrison

In an attempt to open up its technology to a wider audience, enterprise software company Databricks has released Dolly, a large language model and its associated training code under an open-source licence. Despite being based on a much smaller underlying model, the company says it has ChatGPT-like functionality and can be run “in-house”.

Databricks says it was able to achieve similar chat-like functionality from an older, smaller language model (Photo: rarrarorro / Shutterstock)
Databricks says it was able to achieve similar chat-like functionality from an older, smaller language model. (Photo: rarrarorro/Shutterstock)

The move was inspired by the success of OpenAI’s natural language platform ChatGPT, which became one of the fastest-growing consumer apps within a couple of months of its release in November last year. It has since caused some of the world’s largest companies including Microsoft and Google to pivot and release generative and natural language AI tools.

“We show that anyone can take a dated off-the-shelf open source LLM and give it magical ChatGPT-like instruction-following ability by training it in 30 minutes on one machine, using high-quality training data,” Databricks wrote in a blog post explaining the decision.

It found that the type of instruction-following used in ChatGPT “does not seem to require the latest or largest models”, and claims that from just six billion parameters, compared to 175 billion in GPT-3 and many more in GPT-4 or Google’s PaLM, it was able to recreate the functionality of ChatGPT.

“We believe models like Dolly will help democratise LLMs, transforming them from something very few companies can afford into a commodity every company can own and customise to improve their products,” the company said.

Large language models: from LLaMA to Alpaca to Dolly

Developers like OpenAI, Anthropic, AI21 Labs, as well as Microsoft, Google and IBM charge end-users for access to their large language models through API calls. This can become expensive very quickly if you need to make a lot of calls on a regular basis. Alternatively, training those same models is an expensive endeavour that takes hundreds of GPU hours and trillions of words from datasets.

Then Meta released the weights for its high-quality language model, LLaMA, to researchers. It had been trained using more than 80,000 GPU hours, with Stanford University-built Alpaca, on top of LLaMA, tuned to a subset of 50,000 human-like questions and answers. This led to it exhibiting ChatGPT-like functionality from a relatively small training dataset.

Content from our partners
Unlocking growth through hybrid cloud: 5 key takeaways
How businesses can safeguard themselves on the cyber frontline
How hackers’ tactics are evolving in an increasingly complex landscape

Dolly, from Databricks is able to deliver what the company describes as a “surprising degree of instruction-following capabilities” but from a much smaller model. Where the Alpaca team demonstrated that a state-of-the-art model could be used as a chatbot engine, Databricks says even years-old models can be tweaked to have those same types of behaviours if fine-tuned on a small corpus of instruction training data.

“Dolly works by taking an existing open-source six-billion-parameter model from EleutherAI and modifying it ever so slightly to elicit instruction following capabilities such as brainstorming and text generation not present in the original model, using data from Alpaca,” the company explained.

The team were surprised it worked so well given the older and smaller nature of the underlying model compared to those provided by OpenAI or Google. “This suggests that much of the qualitative gains in state-of-the-art models like ChatGPT may owe to focused corpuses of instruction-following training data, rather than larger or better-tuned base models.”

“We’re calling the model Dolly — after Dolly the sheep, the first cloned mammal — because it’s an open-source clone of an Alpaca, inspired by a LLaMA. We’re in the earliest days of the democratisation of AI for the enterprise, and much work remains to be done, but we believe the technology underlying Dolly represents an exciting new opportunity for companies that want to cheaply build their own instruction-following models,” said Databricks in a blog post.

Using an open model rather than sending data to a centralised LLM makes sense for companies with highly sensitive and proprietary data. Handing it over to a third party may be unpalatable to some companies and so making trade-offs in terms of model quality and cost, against the security of using in-house models have to be considered.

Dolly will be available on Databricks with the trained weights available to anyone wanting to experiment with the model. This is the first in a series of announcements from the company which is switching its focus to helping organisations harness large language models. “We believe in the incredible power of artificial intelligence to transform the productivity of every organisation and individual, and welcome you to join us on this journey. Stay tuned for more in this area in the coming weeks.”

Read more: UK AI regulation white paper dodges ChatGPT questions

Topics in this article : , ,
Websites in our network
Select and enter your corporate email address Tech Monitor's research, insight and analysis examines the frontiers of digital transformation to help tech leaders navigate the future. Our Changelog newsletter delivers our best work to your inbox every week.
  • CIO
  • CTO
  • CISO
  • CSO
  • CFO
  • CDO
  • CEO
  • Architect Founder
  • MD
  • Director
  • Manager
  • Other
Visit our privacy policy for more information about our services, how New Statesman Media Group may use, process and share your personal data, including information on your rights in respect of your personal data and how you can unsubscribe from future marketing communications. Our services are intended for corporate subscribers and you warrant that the email address submitted is your corporate email address.