VALL-E: New Microsoft AI can clone your voice in three seconds

Microsoft’s latest foray into the world of artificial intelligence comes in the form of VALL-E, a transformer-based text-to-speech model that can “recreate any voice from a three-second sample clip”. Cybersecurity experts say that without proper protections, it could be used for more realistic phishing attacks and to spread misinformation.

The VALL-E model was trained on 60,000 hours of speech and can generate a new voice from just a three-second sample clip. (Photo courtesy of Microsoft)

As well as reducing the training time to generate a new voice, VALL-E creates a much more natural-sounding synthetic voice than other models by preserving the intonation, charisma, and style of the original sample. These can then be directed as needed when writing the text-to-speech script.

Having these features means that with just three seconds of someone’s voice, recorded from a phone call, in person or even from a podcast, the model can synthesise that voice to say any sentence. It could potentially see words placed into the mouths of a politician, actor or even a family member “asking for money”.

Performance has improved over previous synthetic voice models to such a point that it would be difficult to tell whether you were hearing a real or fake voice, Microsoft says.

Much like large generative AI models used to train DALL-E 2 and GPT-3, developers fed a significant amount of material into the system to create the tool. They used 60,000 hours of speech while training the model, much of which came from recordings made using the Teams app.

VALL-E could be used in gaming – and fintech

The code for VALL-E is not currently available to the public and only sample audio files have been published that were produced using the tool. It also isn’t clear when or if Microsoft plans to make VALL-E available as a public access or commercial tool.

Joshua Kaiser, CEO of AI company Tovie.ai, told Tech Monitor that the model has been designed in such a way that it allows users to do a lot more with a lot less data, which is crucial for organisations that try to create speech synthesis that don’t have enough data for better performance. “We think this will benefit a lot of industries – from retail to fintech to gaming – that are already embracing voice interfaces, by making the whole process more accessible,” he says.

The biggest benefit from VALL-E is its potential scale, says Arun Chandrasekaran, distinguished VP analyst at Gartner. It can be effective in “zero-shot” or “few-shot” scenarios where little domain-specific training data is available. “In addition, if these models can be delivered as a cloud service, they can reduce time/effort required to get the models up and running in contrast to classic approaches,” Chandrasekaran says.

There are several real-world use cases for the technology, Chandrasekaran explains, including “speech editing (where a certain word or sentence can be corrected), contextualizing voice for different scenarios, interactive virtual learning, and customer service automation.”

It does come with risks, including spoofing voice identification or impersonating specific speakers and celebrities, which could lead to more rapid spread of misinformation. This aspect could be why Microsoft has been slow to publish the code behind the technology or release an API, as OpenAI and others have done with text and image generation tools like GPT-3 and DALL-E 2. It would make it easier to carry out phishing attacks using a real voice, or spread fake news online, perhaps through a YouTube video or a podcast.

Spoofing risk of VALL-E

Spoofing could include allowing a cybercriminal to gain access to banks or secure systems that use a voice print as a password, although many of these systems have mechanisms to detect whether it is a live or recorded voice. It could also be used in a phishing scam to take a short sample of a voice from a phone call, then use that sample to create a new voice model that could make it easier to convince someone to part with a password, perhaps by spoofing a finance manager at a company.

Muhammad Yahya Patel, security engineer at Check Point Software said advancement of new technology like VALL-E shouldn’t be feared, but we should still approach systems like this with a degree of caution. “While it has its merits, Microsoft’s new VALL-E text-to-speech model could have some worrying implications for cybersecurity as it becomes more mature and integrated into our daily lives.

“If we have learned anything from the last year, it’s that cybercriminals will exploit any route to trick unsuspecting victims into handing over their passwords or bank details for example. Vishing [a scam phone call] is a popular method used by threat actors, and for good reason given the success rates of these campaigns.”

He said the new technology could give cybercriminals an opportunity to up their game and introduce a personal element, including allowing them to impersonate the voice of a loved one. “This would make it much harder for anyone to differentiate between the request of someone they trust and one from a malicious cybercriminal.

“Equally, as we move towards a time where many banks are now using voice authentication to authorise transactions, it’s easy to see how a threat actor could target an individual and gain access to an account with very minimal effort. It’s key that these opportunities for hackers to leverage new technologies is understood and as such, that the necessary precautions are being taken before it’s too late.”

Tech Monitor has approached Microsoft for a comment on how it plans to mitigate for the potential misuse of VALL-E.