Google researchers say they have developed a new technique for training AI on data-scarce languages, which could transform automatic speech recognition (ASR) techniques, even for rare or unusual languages, and allow for its use in low latency applications like its own voice assistant.
Training an ASR system is time- and data-intensive; they need huge datasets that contain vast amounts of audio and text to successfully train them to interpret languages to a high quality. The world has over 6,000 languages and many of those include vastly different dialects.
Real-time translation/transcription of conversation with someone speaking a less common language obviously becomes a huge problem. (UNESCO says that at least 43 percent of these 6000 languages spoken in the world are endangered. This figure does not include the data-deficient languages, for which no reliable information is available.)
A New Automatic Speech Recognition Training Technique
To potentially solve this issue Google says it has developed a technique that builds on existing research by using “multilingual speech recognition”, in which a single model is used to translate multiple languages.
As its team of researchers puts it: “This is based on the principle that for training, much of the ‘knowledge’ a neural network learns from audio data of a data-rich language is re-usable by data-scarce languages; we don’t need to learn everything from scratch.”
This led the team to put forward an end-to-end (E2E) system that trains a single model and enables real-time multilingual speech recognition. The technique also simplifies training and serving AI translators.
Testing with Lexical Overlap
To train this model Google focused on nine Indian languages, some of which languages have overlapping lexical and acoustic content.
Google combined 37,000 hours of data from Hindi, Marathi, Urdu, Bengali, Tamil, Telugu, Kannada, Malayalam and Gujarati.
In developing the technique the researchers ran into the issue of bias from languages over-represented in the training set. They note: “This is more prominent in an E2E model, which unlike a traditional ASR system, does not have access to additional in-language text data and learns lexical characteristics of the languages solely from the audio training data.”
With some clever architectural modifications, they made good progress.
The resulting E2E multilingual model solves two core issues of ASR and its integration to user devices. The model allows for real-time ASR streaming that can spit out words as if someone was typing. Secondly the model solves the issue of imbalanced data caused by a lack of language data.
In a paper presented at Interspeech 2019, the developers commented that: “Using nine Indian languages, we showed that our best system, built with RNN-T model and adapter modules, significantly outperforms both the monolingual RNN-T models, and the state-of-the-art monolingual conventional recognizers.”
Multilingual Speech Recognition End-to-End Model
How does it work?
The researchers wrote in a paper on their approach to the E2E model they first created a global model that contains data from all the languages.
“We [then] provided an extra language identifier input, which is an external signal derived from the language locale of the training data; i.e. the language preference set in an individual’s phone. This signal is combined with the audio input as a one-hot feature vector.
“We hypothesize that the model is able to use the language vector not only to disambiguate the language but also to learn separate features for separate languages, as needed, which helped with data imbalance.
In a second stage they freeze all model parameters and introduce adapter models. These modules hold different parameters for each language and act as a fine tuner for the system.
Arindrima Datta and Anjuli Kannan, Software Engineers at Google wrote in a blog that: “Putting all of these elements together, our multilingual model outperforms all the single-language recognizers, with especially large improvements in data-scarce languages like Kannada and Urdu.”
The extent to which such a model would work on less immediately related languages is not clear. The team say they are keen to continue work developing multilingual ASRs for other language groups.