Facebook has created and published a new AI training recipe dubbed “RoBERTa” – a tool that has shot to the top of the General Language Understanding Evaluation (GLUE) benchmark leaderboard, Facebook said today.
GLUE is a collection of tools for evaluating the performance of models across a diverse set of existing Natural Language Processing (NLP) tasks. It is designed to help researchers develop ways for their AI systems to process language in a way that is not exclusive to a single task, genre, or dataset.
RoBERTa is an optimisation of Google’s popular BERT system for pre-training Natural Language Processing (NLP) systems that was open sourced in November last year. It relies on unannotated text drawn from the web, as opposed to a language corpus that’s been labeled specifically for a given task.
Facebook #AI’s RoBERTa is a new training recipe that improves on BERT, @GoogleAI’s self-supervised method for pretraining #NLP systems. By training longer, on more data, and dropping BERT’s next-sentence prediction, RoBERTa topped the GLUE leaderboard. https://t.co/BymzsSO5JD
— Facebook AI (@facebookai) July 29, 2019
As Facebook AI put it in a new paper published on Tuesday: “RoBERTa builds on BERT’s language masking strategy, wherein the system learns to predict intentionally hidden sections of text within otherwise unannotated language examples.
“RoBERTa, which was implemented in PyTorch, modifies key hyperparameters in BERT, including removing BERT’s next-sentence pretraining objective, and training with much larger mini-batches and learning rates. This allows RoBERTa to improve on the masked language modeling objective compared with BERT and leads to better downstream task performance. We also explore training RoBERTa on an order of magnitude more data than BERT, for a longer amount of time. We used existing unannotated NLP data sets as well as CC-News, a novel set drawn from public news articles.”
The results showed that BERT training procedures can significantly improve its performance on a variety of NLP tasks, the team said.
Its release is part of Facebook’s “ongoing commitment to advancing the state-of-the-art in self-supervised systems that can be developed with less reliance on time- and resource-intensive data labeling”. It comes days after it also shared a dataset dubbed “WikiMatrix” that includes 135 million parallel sentences for 1,620 different language pairs in 85 different languages. The dataset was extracted from Wikipedia with the aim of directly training neural machine translation systems between distantly related languages, without the need to first translate to English.
The model, pretraining and fine-tuning code implemented in PyTorch for roBERTa is here. The WikiMatrix dataset and examples are here.