The scam is simple. First, target a company and map out its corporate hierarchy. Then, pick a manager with budgetary authority and, while posing as their company’s chief executive, email them to say you’ve agreed a large purchase order but you need the funds by end of play today. Before they have a chance to reply, call them up and repeat the message – this time, in the voice of their CEO. “I need the funds yesterday,” you tell them. “Can you make that happen?”
This scenario is known as a ‘vishing’ scam. Heavily reliant on behavioural manipulation – in this case, the target’s eagerness to please their superiors – its success also depends on a sophisticated understanding of the corporate hierarchy and a talent for impersonation. Pull it off and the prizes can stretch to the millions – up to $35 million, in fact, in the case of one attempted heist in the UAE back in January 2020. Even so, the need for a convincing mimic means that such scams are comparatively rare, constituting an estimated 1% of all phishing scams.
That could soon change, as technology is now capable of cloning the human voice. While the ability to synthesise speech from text has been with us for decades, it is only in the past few years that deep learning models have become capable of producing voices than sound recognisably human. These ‘audio deepfakes’ impersonations have received less media attention than their video counterparts but criminals have recognised their potential as a tool of deception.
The first documented case of an audio deepfake used in a vishing scam took place in March 2019, when the chief executive of a UK-based energy company was convinced into wiring over £200,000 to what he thought was its parent firm. “He was very sure that it was the same voice, the same quality of speech,” explains Marie-Christine Kragh, the global head of fidelity at the company’s insurer, Euler Hermes. The victim, she adds, had immediately recognised his superior’s subtle German accent.
Unable to trace the money, Euler Hermes reimbursed their client. Kragh predicts that the pace of research and development in this area means that the insurance firm will do the same for more clients in years to come. “I think there are definitely more cases that are not being reported,” she says. “We have seen an increase in the scheme itself, and we have seen it developing massively. And quite honestly, I can’t see a reason why they would stop now.”
Recipes for an audio deepfake
There are several ingredients needed to make an audio deepfake, explains Carnegie research fellow and AI expert Jon Bateman. First, you need the right text-to-speech algorithm. Then, you need sufficient computing power and storage, and enough audio samples for the algorithm to imitate. All three have become easier to source in the past three years.
“Better algorithms are available, and then with it, more user-friendly software,” says Bateman. “Less computing power is generally needed because the algorithms are more efficient. And then, less training data is required.”
Just how many audio samples are required varies from model to model. In July 2019, Israeli cybersecurity researchers warned of one program that only required around 20 minutes of recordings to create a sufficiently convincing voice clone. Descript, formerly Lyrebird AI, offers a similar service that claims to achieve the same feat in a couple of minutes. Other open-source models have pared down the training time to seconds. These examples, says Bateman, are less convincing. “The result is very artificial sounding,” he says. “It just sounds strange. But if you listen close enough, you can feel like you’re hearing an echo of that voice.”
These models do not appear as if they would be enough to fool someone into handing over thousands of pounds. But the shortcomings of an audio deepfake can be overcome with sophisticated behavioural manipulation.
“You don’t have to get to a perfect level of verisimilitude,” says Bateman. “A lot of scams try to exploit time pressures, or emotional manipulation, or other contextual evidence that you are who you say you are. The goal is to get the victim to not pay attention to, or not give as much credence to, gaps or flaws in the impersonation.”
Gary Schildhorn certainly wasn’t listening to his son Brett’s cadence when he called him on his way to work last summer. Weeping, Brett explained that he had hit a pregnant woman in a car accident. He was now in custody, he said, and a public defender named Barry Goldstein would call him soon about the bail bond. Within a couple of minutes, said lawyer rang Schildhorn with the figure: $15,000, to be paid within the next couple of hours.
“They hit every button I have,” recalls Schildhorn. “I’m a lawyer. I fix things, right? I’m a father to my child, and people are hurt. They need me. I have to jump in and fix it, solve the problem.”
What the scammer didn’t realise is that in the time between the call from ‘Brett’ and the public defender, Schildhorn had called his daughter-in-law and his son’s place of work to inform them of the accident. Word eventually got to the real Brett Schildhorn, who rang his father up to inform him that the whole incident had been a scam. Gary felt relieved – and utterly drained. “When I sat in the car after I knew it was a scam, it almost made me physically exhausted,” he says.
Catching audio deepfake scams
Schildhorn doesn’t know for certain if the fraudulent call was from an audio deepfake or a skilled impersonator. He’s since heard from victims of identical scams, suggesting the availability of a tool that can be used at scale. “They’re using a lawyer with a Jewish name… they’re using your son’s hurt,” Schildhorn says. “They haven’t varied it much at all.”
Still, audio deepfakes are extremely difficult to catch red-handed. Victims rarely record their calls and the technology to detect synthetic speech in real-time is in its infancy. Usually, investigators only have contextual clues to go on: suspicious emails, a faultless imitation and the absence of pauses that would suggest an impressionist receiving instructions from a co-conspirator.
As a result, the actual number of use cases proven beyond a reasonable doubt remains low. In 2019, antivirus provider Symantec announced three instances where audio deepfakes had been used against their clients. And while Kragh suspects the technology has probably been used in scams directed against other clients of Euler Hermes, the insurer has only been able to prove it in one case.
Only one audio deepfake used for criminal purposes has been captured in the wild. In 2020, cybersecurity company NISOS was sent a recording of a suspicious voicemail by one of its clients. It was immediately run through an audio spectrogram analytic tool and compared to a real human voice speaking the same dialogue.
“You can tell immediately that it’s different,” explains Rob Volkert, managing principal at NISOS. The most obvious distinction was the speed of the recording. Human speech usually follows a specific cadence speckled with pauses. The voice on the recording, meanwhile, pitched up and down in unusual ways. “There was also a complete absence of background noise, too, which is not normal,” says Volkert.
An example of a failed attempt at obtaining a fraudulent wire transfer using an audio deepfake. (Recording courtesy of NISOS)
The deepfake NISOS eventually released – listen to a section of the recording above – was a poor demonstration of the technology’s potential. The generated speech, for example, sounded robotic, and the fact that it had been left on a voicemail doesn’t imply it was left by a genius fraudster (“The criminal is probably not going to try and leave a trail,” says Volkert.”) Its existence, however, is suggestive of a growing black market demand for audio deepfakes – albeit one where supply remains limited.
“The trend we’ve seen hasn’t been so much as, ‘Hey, I’m advertising the ability to create a video or an audio clip,” says Volkert. “The criminals probably use open source software, and then tweak it to their needs and then go out and deploy it.”
Defensive measures
What, then, can companies do to protect against the threat? From a technological standpoint at least, the answer is currently very little. Instead, explains Volkert, businesses should focus on educating their workforce on how they can spot the use of an audio deepfake – a task that’s simpler than it sounds. “I would advise companies to look at this as just an evolution of a phishing scam, at least right now,” he says.
Defeating these can be as easy as training a workforce to use challenge questions every time they are asked to transfer money by email and over the phone, and following up with the IT department if a communication looks suspicious. Businesses can also look at implementing other means of internal verification that ensure capital flows can never be made through verbal communication alone, which could come in the form of biometric scanners.
Defensive measures should not preclude the possibility of other, less obvious forms of deepfake fraud, says Bateman. As the technology is perfected, large companies will become vulnerable to disinformation tactics. A counterfeit recording of their chief executive making damaging remarks that are subsequently ‘leaked,’ could help tank share prices – and fill the pockets of rogue short-sellers in the bargain.
This kind of attack is very difficult to debunk, “especially if the audio is purportedly some kind of private conversation for the CEO,” explains Bateman. “You’d have to rely on your reputation, in all likelihood. CEOs with strong reputations would be able to fend off such an attack, but for a CEO with a more mixed reputation or someone who is already in some kind of PR crisis, this can be damaging.”
For the moment, such a scandal has yet to materialise. Audio deepfakes are still largely in their experimental stage, with criminals still appearing to dabble with open-source versions of the software. But the potential gains could be huge. “From 2016 to 2019, we already had €165m of losses caused by 65 cases” of vishing scams, says Kragh. Individual cases range from thefts of €150,000 all the way up to €50m.
This widespread move to remote working has likely made the situation worse. “You could say that that’s the perfect breeding ground for social engineering, because there’s a lot of uncertainty there,” says Kragh. Contact between employees is now almost exclusively made through email and video calls, for one. The ubiquity of virtual ‘town hall’ meetings has also made it easier for the industrious hacker to obtain training data for deepfake models.
The best defence against such efforts, explains Kragh, is to train employees to think critically about what is being asked of them during calls from superiors. That means, she says, “making sure that, whatever you have right in front of you, you understand what’s going on and you don’t shy away from asking those challenging questions.”
In the meantime, companies should take care to remember that the threat of audio deepfakes is still far away on the horizon. As far as the criminal threat to businesses is concerned, it’s more often the case that the danger lies far closer to home – in many cases, their own staff. “The vast majority of claims are caused by these insiders,” explains Kragh, the very people who know the company the best. “That’s something leaders need to be very aware of. It doesn’t mean don’t trust your employees – it’s very important to do so. But it’s important to not close your eyes, and stay vigilant.”