Sign up for our newsletter - Navigating the horizon of business technology​
Technology / Emerging Technology

Bigger than BERT? Microsoft’s New Language Model Sets Performance Records

Microsoft has unveiled the world’s largest deep learning language model to-date: a 17 billion-parameter “Turing Natural Language Generation (T-NLG)” model that the company believes will pave the way for more fluent chatbots and digital assistants.

The T-NLG “outperforms the state of the art” on a several benchmarks, including summarisation and question answering, Microsoft claimed in a new research blog, as the company stakes its claim to a potentially dominant position in one of the most closely watched new technologies, natural language processing.

Deep learning language models like BERT, developed by Google, have hugely improved the powers of natural language processing, by training on colossal data sets with billions of parameters to learn the contextual relations between words.

See also: Meet BERT: The NLP Technique That Knows Paris from Paris Hilton

Bigger is not always better, those working on language models may recognise, but Microsoft scientist Corby Rosset said his team “have observed that the bigger the model and the more diverse and comprehensive the pretraining data, the better it performs at generalizing to multiple downstream tasks even with fewer training examples.”

White papers from our partners

He emphasised: “Therefore, we believe it is more efficient to train a large centralized multi-task model and share its capabilities across numerous tasks.”

A Microsoft illustration shows the scale of the model.

Like BERT, Microsoft’s T-NLG is a Transformer-based generative language model: i.e. it can generate words to complete open-ended textual tasks, as well as being able to generate direct answers to questions and summaries of input documents. (Your smartphone’s assistant autonomously booking you a haircut was just the start…)

It is also capable of answering “zero shot” questions, or those without a context passage, outperforming “rival” LSTM models similar to CopyNet.

Rosset noted: “A larger pretrained model requires fewer instances of downstream tasks to learn them well.

“We only had, at most, 100,000 examples of “direct” answer question-passage-answer triples, and even after only a few thousand instances of training, we had a model that outperformed the LSTM baseline that was trained on multiple epochs of the same data. This observation has real business impact, since it is expensive to collect annotated supervised data.”

The New Deep Learning Language Model Tapped NVIDIA DGX-2

As no model with over 1.3 billion parameters can run on a single GPU, the model itself must be parallelised, or broken into pieces, across multiple GPUs, Microsoft said, adding that it took advantage of several hardware and software breakthroughs.

1: We leverage a NVIDIA DGX-2 hardware setup, with InfiniBand connections so that communication between GPUs is faster than previously achieved.

2: We apply tensor slicing to shard the model across four NVIDIA V100 GPUs on the NVIDIA Megatron-LM framework.

3: DeepSpeed with ZeRO allowed us to reduce the model-parallelism degree (from 16 to 4), increase batch size per node by fourfold, and reduce training time by three times. DeepSpeed makes training very large models more efficient with fewer GPUs, and it trains at batch size of 512 with only 256 NVIDIA GPUs compared to 1024 NVIDIA GPUs needed by using Megatron-LM alone. DeepSpeed is compatible with PyTorch.”

See also: Microsoft Invests $1 Billion in OpenAI: Eyes “Unprecedented Scale” Computing Platform

A language model attempts to learn the structure of natural language through hierarchical representations, using both low-level features (word representations) and high-level features (semantic meaning). Such models are typically trained on large datasets in an unsupervised manner, with the model using deep neural networks to “learn” the syntactic features of language beyond simple word embeddings.

Yet as AI specialist Peltarion’s head of research Anders Arpteg put it Computer Business Review recently:NLP generally has a long way to go before it’s on par with humans at understanding nuances in text. For instance, if you say, ‘a trophy could not be stored in the suitcase because it was too small’, humans are much better at understanding whether it’s the trophy or the suitcase that’s too small.”

He added: “In addition, the complex coding… can mean many developers and domain experts aren’t equipped to deal with it, and, despite being open-sourced, it’s hard for many companies to make use of it. BERT was ultimately built by Google, for the likes of Google, and with tech giants having not only access to superior skills, but resources and money, BERT remains inaccessible for the majority of companies.”

The T-NLG was created by a larger research group, Project Turing, which is working to add deep learning tools to text and image processing, with its work being integrated into products including Bing, Office, and Xbox.

Microsoft is releasing a private demo of T-NLG, including its freeform generation, question answering, and summarisation capabilities, to a “small set of users” within the academic community for initial testing and feedback, as it refines the model.


This article is from the CBROnline archive: some formatting and images may not be present.

CBR Staff Writer

CBR Online legacy content.