View all newsletters
Receive our newsletter - data, insights and analysis delivered to you

This Single AI Model Was Used to Translate Multiple Languages

"Much of the 'knowledge' a neural network learns from audio data of a data-rich language is re-usable by data-scarce languages."

By CBR Staff Writer

Google researchers say they have developed a new technique for training AI on data-scarce languages, which could transform automatic speech recognition (ASR) techniques, even for rare or unusual languages, and allow for its use in low latency applications like its own voice assistant.

Training an ASR system is time- and data-intensive; they need huge datasets that contain vast amounts of audio and text to successfully train them to interpret languages to a high quality. The world has over 6,000 languages and many of those include vastly different dialects.

Real-time translation/transcription of conversation with someone speaking a less common language obviously becomes a huge problem. (UNESCO says that at least 43 percent of these 6000 languages spoken in the world are endangered. This figure does not include the data-deficient languages, for which no reliable information is available.)

automatic speech recognition

Credit: UNESCO

A New Automatic Speech Recognition Training Technique

To potentially solve this issue Google says it has developed a technique that builds on existing research by using “multilingual speech recognition”, in which a single model is used to translate multiple languages.

As its team of researchers puts it: “This is based on the principle that for training, much of the ‘knowledge’ a neural network learns from audio data of a data-rich language is re-usable by data-scarce languages; we don’t need to learn everything from scratch.”

This led the team to put forward an end-to-end (E2E) system that trains a single model and enables real-time multilingual speech recognition. The technique also simplifies training and serving AI translators.

Testing with Lexical Overlap

To train this model Google focused on nine Indian languages, some of which languages have overlapping lexical and acoustic content.

Content from our partners
Green for go: Transforming trade in the UK
Manufacturers are switching to personalised customer experience amid fierce competition
How many ends in end-to-end service orchestration?

Google combined 37,000 hours of data from Hindi, Marathi, Urdu, Bengali, Tamil, Telugu, Kannada, Malayalam and Gujarati.

In developing the technique the researchers ran into the issue of bias from languages over-represented in the training set. They note: “This is more prominent in an E2E model, which unlike a traditional ASR system, does not have access to additional in-language text data and learns lexical characteristics of the languages solely from the audio training data.”

With some clever architectural modifications, they made good progress.

The resulting E2E multilingual model solves two core issues of ASR and its integration to user devices. The model allows for real-time ASR streaming that can spit out words as if someone was typing. Secondly the model solves the issue of imbalanced data caused by a lack of language data.

In a paper presented at Interspeech 2019, the developers commented that: “Using nine Indian languages, we showed that our best system, built with RNN-T model and adapter modules, significantly outperforms both the monolingual RNN-T models, and the state-of-the-art monolingual conventional recognizers.”

Multilingual Speech Recognition

Credit: Google

Multilingual Speech Recognition End-to-End Model

How does it work?

The researchers wrote in a paper on their approach to the E2E model they first created a global model that contains data from all the languages.

“We [then] provided an extra language identifier input, which is an external signal derived from the language locale of the training data; i.e. the language preference set in an individual’s phone. This signal is combined with the audio input as a one-hot feature vector.

“We hypothesize that the model is able to use the language vector not only to disambiguate the language but also to learn separate features for separate languages, as needed, which helped with data imbalance.

In a second stage they freeze all model parameters and introduce adapter models. These modules hold different parameters for each language and act as a fine tuner for the system.

Multilingual Speech Recognition

Credit: Google

 

Arindrima Datta and Anjuli Kannan, Software Engineers at Google wrote in a blog that: “Putting all of these elements together, our multilingual model outperforms all the single-language recognizers, with especially large improvements in data-scarce languages like Kannada and Urdu.”

The extent to which such a model would work on less immediately related languages is not clear. The team say they are keen to continue work developing multilingual ASRs for other language groups.

See Also:  Props to the Parrots, Power to the People: Celebrating the Humans that Make AI Work

Websites in our network
Select and enter your corporate email address Tech Monitor's research, insight and analysis examines the frontiers of digital transformation to help tech leaders navigate the future. Our Changelog newsletter delivers our best work to your inbox every week.
  • CIO
  • CTO
  • CISO
  • CSO
  • CFO
  • CDO
  • CEO
  • Architect Founder
  • MD
  • Director
  • Manager
  • Other
Visit our privacy policy for more information about our services, how New Statesman Media Group may use, process and share your personal data, including information on your rights in respect of your personal data and how you can unsubscribe from future marketing communications. Our services are intended for corporate subscribers and you warrant that the email address submitted is your corporate email address.
THANK YOU