Ian Bigford

The unique risks of audio deepfakes

3/25/20257 min read

The unique risks of audio deepfakes

Voice cloning from a few seconds of audio is basically a solved problem now. ElevenLabs and its competitors keep getting better, adoption keeps growing, and the uncomfortable truth is that our ears just aren't equipped to tell the difference. Especially in real-world conditions like a noisy phone call or a conversation where you're emotionally off-balance. This one concerns me more than image or video deepfakes because the attack surface is so much more personal.

We're terrible at spotting fake voices

The research on this is pretty damning.

A UCL study found that participants could correctly identify deepfake speech only 73% of the time across English and Mandarin, even after training. A 2024 study in Scientific Reports found something arguably worse. While listeners could match AI voices to real identities at about 80% accuracy, they correctly flagged an AI voice as fake only 60% of the time. People are biased toward believing a fake voice is real.

The Auditory Blindspot, a diagram showing that human auditory processing actively evaluates prosody, conversational flow, emotional tone, and speaker cadence while completely overlooking the low-level acoustic artifacts (vocoder fingerprints, phase coherence, missing breath spectra, and harmonic regularization) where deepfakes hide

This makes sense when you think about what we actually listen for. Humans rely on heuristics like prosody and conversational flow, which are exactly the things modern TTS models are trained to nail. Add phone line compression, background noise, or emotional distress to the mix and you've created perfect conditions for audio social engineering.

Automated detection works great until it doesn't

Machine learning detectors dramatically outperform humans. Systems like DeepSonar reported ~98% accuracy across various datasets and noise conditions. Other real-time detectors have hit over 99% frame-level accuracy on in-distribution data. The ASVspoof challenges have shown steady improvement in classifier performance across rounds.

What the ML Detector Sees, a side-by-side log-Mel spectrogram comparison of a human voice (irregular formant transitions, breath artifacts in the 100-300Hz band, micro-tremor in harmonics) versus an AI-synthesized voice (unnaturally uniform pitch period, silent breath band, periodic harmonic grid) showing the microscopic differences invisible to human ears but reliably detectable by trained CNNs

But the generalization problem is severe. A 2024 review in Cybersecurity (SpringerOpen) found that detector performance can degrade by an average of 48% when facing unseen conditions (see also the ASVspoof 2025 overview). That degradation shows up when the detector encounters audio from a generator it hasn't been trained on, different codecs, channels, or compression artifacts, or even light post-processing.

The Generalization Cliff, a data visualization showing ML detector accuracy at ~98% on lab benchmark data dropping to ~50% on real world in-the-wild audio, annotated with the four causes of domain shift including new generators, different codecs, light post-processing, and phone compression

This is a classic domain shift problem. Lab performance doesn't guarantee real world robustness, and in the real world, attackers aren't going to use the generators you trained against.

The threat is getting worse, fast

Several trends are compounding the problem simultaneously. Both open source and commercial voice synthesis models are getting better at capturing breath, micro-tremors, and other subtle markers of natural speech (VALL-E, NaturalSpeech 2). The commercial platforms are doing decent work on safety features, but open source availability means those controls only apply to paying customers. Meanwhile, the barrier to entry keeps dropping. A few seconds of clean audio is enough for many cloning pipelines, and the compute is cheap (Real-Time Voice Cloning, OpenVoice). A TikTok video or a Facebook post is now sufficient source material, which represents a fundamentally new attack surface for ordinary people. Perhaps most concerning is that most people don't know this is possible. Social engineering attacks using voices that project authority like a CEO or distress like a kidnapping victim have demonstrably high success rates (FBI IC3 2023, FCC ruling on AI robocalls).

This isn't theoretical. It's already happening in the wild. In 2019, a voice clone of a CEO was used to trick an executive into wiring €220,000 (ICAEW). In 2024, a multi-person video and voice deepfake scheme in Hong Kong led to a $25.6 million transfer (Guardian). Voice clones mimicking President Biden were used in robocalls to New Hampshire voters ahead of the 2024 primary (NPR). Criminals are using cloned voices of loved ones to fake kidnappings and demand immediate payment (FTC consumer alert).

Defense needs layers

No single solution fixes this. Effective defense requires multiple complementary approaches working together.

The first layer is provenance and labeling. Standards like C2PA's Content Credentials create an auditable trail for digital media, tracking where it came from and what was modified. This is important foundational work, but it has two significant limitations. It only works if everyone adopts it across capture devices, software, and platforms, and provenance data can be stripped during transcoding, compression, or re-recording. Anyone who knows C2PA exists can trivially bypass it.

The second layer is better automated detection. To beat the generalization problem, detectors need to be trained for field robustness rather than benchmark performance. That means diversifying training data across generators, languages, codecs, and noise profiles. It also means going beyond spectrogram textures to look at temporal dynamics, phase artifacts, and vocoder fingerprints, and building closed loop pipelines that feed hard to detect real world examples back into training. I've spent a lot of time on this specific problem (see my post on running 50 experiments with deepfake detectors) and the generalization issue is genuinely hard. Most detectors learn the fingerprint of generators they've seen and fall apart on everything else.

The third layer is user-side voice verification, and this is the one I find most interesting as a future defense. The idea is essentially a voice-of-me system. You record your voice locally, your device extracts a non-invertible biometric signature that can verify your voice but can't be used to synthesize it. You share the signature with trusted contacts via encrypted exchange. When someone calls claiming to be you, the recipient's device checks the incoming audio against the stored signature. Throw in a liveness challenge like "repeat this random phrase" and you've got something that's genuinely hard for a real time deepfake pipeline to beat, especially over a network with latency.

sequenceDiagram
  autonumber
  participant CD as Caller's Device
  participant CC as Secure Channel (E2E)
  participant RD as Recipient's Device
  participant LK as Local Key Store
 
  Note over CD: Enrollment — one-time setup
  CD->>CD: Record voice sample locally
  CD->>CD: Extract non-invertible biometric signature
  Note right of CD: Signature can verify identity<br/>but cannot synthesize the voice
  CD->>CC: Share signature (end-to-end encrypted)
  CC->>RD: Deliver to trusted contacts
  RD->>LK: Store trusted voice signature
 
  Note over CD,RD: Call verification — every call
  CD->>RD: Incoming call (claims to be owner)
  RD->>CD: Issue liveness challenge — "repeat phrase #4829"
  CD->>RD: Spoken response (streamed audio)
  RD->>LK: Fetch stored signature
  RD->>RD: Local biometric match + liveness check
 
  alt Signature matches AND liveness passes
    RD-->>RD: ✓ VERIFIED — identity confirmed
  else Mismatch or liveness failure
    RD-->>RD: ✗ ALERT — possible deepfake, abort
  end

This shifts the burden from unreliable human intuition under stress to an automated, device-level check. That's the right direction.

The uncomfortable gap between detection and threat

The situation is pretty uncomfortable when you lay it all out. Humans detect voice deepfakes at 60-73% in studies. Automated detectors hit 98%+ in the lab but struggle badly with unseen attacks. And the technology to create convincing clones is cheap, accessible, and improving faster than defenses can keep up.

Organizations should pair detection tools with strict out-of-band verification for anything high risk. If someone calls requesting a wire transfer, call them back on a known number. Platforms should adopt provenance standards like C2PA even though they're imperfect. For individuals, the most pragmatic defense right now is old-fashioned. Pre-arranged safe words with family members, callback protocols, and healthy skepticism when someone you love calls in distress. That last one is the hardest to maintain, which is exactly why attackers exploit it.

References