Researchers from Google’s DeepMind and the University of Oxford have collaborated to create a highly accurate lip-reading software using artificial intelligence.
The AI system, which was trained using almost 5000 hours of TV footage from the BBC, contained a total of 118,000 sentences from the videos.
The key contributions detailed in the report included a ‘Watch, Listen, Attend and Spell’ (WLAS) network structure, which learns to transcribe videos of mouth motion to characters.
In precise explanation of its research, DeepMind researchers explained that the aim of the study “is to recognise phrases and sentences being spoken by a talking face, with or without the audio.
“Unlike previous works that have focussed on recognising a limited number of words or phrases, we tackle lip reading as an open-word problem- unconstrained natural language sentences, and in the wild videos.”
It was found that the AI was trained on shows which aired during the period between January 2010 and December 2015, and later tested its performance on programmes broadcast between March and September 2016.
The performance of the system was compared to that of humans, whereby a professional lip reading company was instructed to decipher a random sample of 200 videos.
It was identified that the professional lip reader was only able to decipher less than one-quarter of the spoken words, whilst the WLAS model was able to decipher half of the spoken words.
In an interview with News Scientist, Ziheng Zhou at University of Oulu, Finland said: “It’s a big step for developing fully automatic lip-reading systems. Without the huge data et, it’s very difficult for us to verify new technologies like deep learning.”
DeepMind researchers believe that the program could include a host of applications, such as assisting hearing-impaired people to understand conversations.
It may also be used to annotate silent films and assist with the control or digital assistants like Siri or Amazon’s Alexa.