Anyone that might be concerned about computers taking over look away now, because they are a step closer to sounding just like humans.
Researchers in the UK at Google’s DeepMind unit have been working on making computer-generated speech sound as “natural” as humans.
The technology, called WaveNet, which is focused on the area of speech synthesis, or text-to-speech, was found to sound more natural than any of Google’s products.
However, this was only achieved after the WaveNet artificial neural network was trained to produce English and Chinese speech which required copious amounts of computing power, so the technology probably won’t be hitting the mainstream any time soon.
Using a convolutional neural network, which is used for artificial intelligence in deep learning, it is trained on data and then the systems make inferences about new data, in addition to being used to generate new data.
Training WaveNet required Google’s North American English and Mandarin TTS data from professional female speakers. Once it was trained it was put into competition against a parametric system that relies on a hidden Markov model and a concatenative system, this relies on a long short-term memory recurrent neural network, the research said.
WaveNet was found to have performed “significantly better” than the other system, but it was not felt to be more human than actual human recordings, yet.
Really the underlying problem is how much data is being created by the need to take at least 16,000 samples of waveforms a second, currently this makes its mass production unlikely.
Human speech isn’t the only thing that the researchers have been testing the technology on, it was also trained on solo piano music on YouTube to produce new music.
Examples of the music and of the speech tests can be found here. The speech results do sound suspiciously human and the music it created is certainly worth a listen.
The researchers wrote: "WaveNets open up a lot of possibilities for TTS, music generation and audio modelling in general. The fact that directly generating timestep per timestep with deep neural networks works at all for 16kHz audio is really surprising, let alone that it outperforms state-of-the-art TTS systems. We are excited to see what we can do with them next.”
The paper can be found here.