View all newsletters
Receive our newsletter - data, insights and analysis delivered to you
  1. Technology
  2. Emerging Technology
June 27, 2022updated 29 Jun 2022 3:17pm

Sound of the metaverse: Meta creates AI models to improve virtual audio

New AI models created by Meta can improve the quality of virtual reality audio, the company claims.

By Ryan Morrison

Zoom calls, meetings in the metaverse and virtual events could all be improved in the future thanks to a series of AI models developed by engineers at Meta, which the company says match sound to imagery, mimicking the way humans experience sound in the real world.

Meta's AI model at work
Meta’s new AI model can match the sound of an audio stream with the image of a room. (Image by LeoPatrizi / iStock)

The three models, developed in partnership with researchers from the University of Texas at Austin, are known as Visual-Acoustic Matching, Visually-Informed Dereverberation and VisualVoice. Meta has made the models available for developers.

“We need AI models that understand a person’s physical surroundings based on both how they look and how things sound,” the company said in a blog post explaining the new models.

“For example, there’s a big difference between how a concert would sound in a large venue versus in your living room. That’s because the geometry of a physical space, the materials and surfaces in the area, and the proximity of where the sounds are coming from all factor into how we hear audio.”

Meta’s new audio AI models

The Visual Acoustic-Matching model can take an audio clip recorded anywhere, along with an image of a room or other space, and transform the clip to make it sound like it was recorded in that room.

An example use case for this could be to ensure people in a video chat experience sound the same way. So if one is at home, another in a coffee shop and a third in an office the sound could be adapted so that what you hear comes across as if it were in the room you are sitting in.

Visually-Informed Dereverberation is a model that does the opposite, it takes sounds and visual cues from a space, then focuses on removing reverberation from the space. For example, it can focus on the music from a violin even if it is recorded inside a large train station.

Content from our partners
Unlocking growth through hybrid cloud: 5 key takeaways
How businesses can safeguard themselves on the cyber frontline
How hackers’ tactics are evolving in an increasingly complex landscape

Finally, the VisualVoice model uses visual and audio cues to split speech from other background sounds and voices, allowing the listener to focus on a specific conversation. This could be used in a large conference hall with lots of people mingling.

This focused audio technique could also be used to generate better quality subtitles or make it easier for future machine learning to understand speech output when more than one person is talking, Meta explained.

How AI can improve audio in virtual experiences

Rob Godman, reader in music at the University of Hertfordshire and an expert in acoustic spaces, told Tech Monitor this work feeds into a human need to understand where we are in the world and brings it to virtual settings.

“We have to think about how humans perceive sound in their environment,” Godman says. “Human beings want to know where sound is coming from, how big a space is and how small a space is. When listening to sound being created we listen to several different things. One is the source, but you also listen to what happens to sound when combined with the room – the acoustics.”

Being able to capture and mimic that second aspect correctly could make virtual worlds and spaces seem more realistic, he explains, and do away with the disconnect humans might experience if the visuals don’t accurately match the audio.

An example of this could be a concert where a choice is performing outdoors, but the actual audio is recorded inside a cathedral, complete with significant reverb. That reverb wouldn’t be expected on a beach, so the mismatch of sound and visual would be unexpected and off putting.

Godman said the biggest change is how the perception of the listener is considered when implementing these AI models. “The position of the listener needs to be thought out a great deal,” he says. “The sound made close to a person compared to metres away is important. It is based around the speed of sound in air so a small delay in the time it takes to get to a person is utterly crucial.”

He said part of the problem with improving audio is the lack of end-user equipment, explaining users will “spend thousands of pounds on curved monitor but won’t pay more than £20 for a pair of headphones”.

Professor Mark Plumbley, EPSRC Fellow in AI for Sound at the University of Surrey, is developing classifiers for different types of sounds so they can be removed or highlighted in recordings. “If you are going to create this realistic experience for people you need the vision and sound to match,” he says.

“It is harder for a computer than I think it would be for people. When we are listening to sounds there is an effect called directional marking that helps us focus on the sound from somebody in front of us and ignore sounds from the side.

This is something we’re used to doing in the real world, Plumbley says. “If you are in a cocktail party, with lots of conversations going on, you can focus on the conversation of interest, we can block out sounds from the side or elsewhere,” he says. “This is a challenging thing to do in a virtual world.”

He says a lot of this work has come about because of changes in machine learning, with better deep learning techniques that work across different disciplines, including sound and image AI. “A lot of these things are related to signal processing,” Plumbley adds.

“Whether sounds, gravitational waves or time series information from financial data. They are about signals that come over time. In the past researchers had to build individual ways for different types of objects to extract out different things. Now we are finding deep learning models are able to pull out the patterns.”

Read more: Google’s LaMBDA AI is not sentient but could pose a security risk

Topics in this article : ,
Websites in our network
Select and enter your corporate email address Tech Monitor's research, insight and analysis examines the frontiers of digital transformation to help tech leaders navigate the future. Our Changelog newsletter delivers our best work to your inbox every week.
  • CIO
  • CTO
  • CISO
  • CSO
  • CFO
  • CDO
  • CEO
  • Architect Founder
  • MD
  • Director
  • Manager
  • Other
Visit our privacy policy for more information about our services, how New Statesman Media Group may use, process and share your personal data, including information on your rights in respect of your personal data and how you can unsubscribe from future marketing communications. Our services are intended for corporate subscribers and you warrant that the email address submitted is your corporate email address.
THANK YOU