Microsoft Unveils Record-Breaking New Deep Learning Language Model

Microsoft has unveiled the world’s largest deep learning language model to-date: a 17 billion-parameter “Turing Natural Language Generation (T-NLG)” model that the company believes will pave the way for more fluent chatbots and digital assistants.

The T-NLG “outperforms the state of the art” on a several benchmarks, including summarisation and question answering, Microsoft claimed in a new research blog, as the company stakes its claim to a potentially dominant position in one of the most closely watched new technologies, natural language processing.

Deep learning language models like BERT, developed by Google, have hugely improved the powers of natural language processing, by training on colossal data sets with billions of parameters to learn the contextual relations between words.

See also: Meet BERT: The NLP Technique That Knows Paris from Paris Hilton

Bigger is not always better, those working on language models may recognise, but Microsoft scientist Corby Rosset said his team “have observed that the bigger the model and the more diverse and comprehensive the pretraining data, the better it performs at generalizing to multiple downstream tasks even with fewer training examples.”

He emphasised: “Therefore, we believe it is more efficient to train a large centralized multi-task model and share its capabilities across numerous tasks.”

A Microsoft illustration shows the scale of the model.

Like BERT, Microsoft’s T-NLG is a Transformer-based generative language model: i.e. it can generate words to complete open-ended textual tasks, as well as being able to generate direct answers to questions and summaries of input documents. (Your smartphone’s assistant autonomously booking you a haircut was just the start…)

It is also capable of answering “zero shot” questions, or those without a context passage, outperforming “rival” LSTM models similar to CopyNet.

Rosset noted: “A larger pretrained model requires fewer instances of downstream tasks to learn them well.

“We only had, at most, 100,000 examples of “direct” answer question-passage-answer triples, and even after only a few thousand instances of training, we had a model that outperformed the LSTM baseline that was trained on multiple epochs of the same data. This observation has real business impact, since it is expensive to collect annotated supervised data.”

The New Deep Learning Language Model Tapped NVIDIA DGX-2

As no model with over 1.3 billion parameters can run on a single GPU, the model itself must be parallelised, or broken into pieces, across multiple GPUs, Microsoft said, adding that it took advantage of several hardware and software breakthroughs.

“1: We leverage a NVIDIA DGX-2 hardware setup, with InfiniBand connections so that communication between GPUs is faster than previously achieved.

“2: We apply tensor slicing to shard the model across four NVIDIA V100 GPUs on the NVIDIA Megatron-LM framework.

“3: DeepSpeed with ZeRO allowed us to reduce the model-parallelism degree (from 16 to 4), increase batch size per node by fourfold, and reduce training time by three times. DeepSpeed makes training very large models more efficient with fewer GPUs, and it trains at batch size of 512 with only 256 NVIDIA GPUs compared to 1024 NVIDIA GPUs needed by using Megatron-LM alone. DeepSpeed is compatible with PyTorch.”

Sign up for our weekly news round-up!

Sign up to the newsletter: In Brief

See also: Meet BERT: The NLP Technique That Knows Paris from Paris Hilton

The New Deep Learning Language Model Tapped NVIDIA DGX-2

Sign up for our regular news round-up!

Sign up for our weekly news round-up!

Sign up to the newsletter: In Brief

I would also like to subscribe to:

Thank you for subscribing