Ian Bigford

The unique risks of audio deepfakes

3/25/20257 min read

The unique risks of audio deepfakes

Voice cloning from a few seconds of audio is basically a solved problem now. ElevenLabs and its competitors keep getting better, adoption keeps growing, and the uncomfortable truth is that human ears are not equipped to tell the difference—especially under real-world conditions like a noisy phone call or an emotionally charged conversation.

We're terrible at spotting fake voices

The research on this is pretty damning.

A UCL study found that participants could correctly identify deepfake speech only 73% of the time across English and Mandarin—even after training. A 2024 study in Scientific Reports found something arguably worse: while listeners could match AI voices to real identities at about 80% accuracy, they correctly flagged an AI voice as fake only 60% of the time. People are biased toward believing a fake voice is real.

The Auditory Blindspot — a diagram showing that human auditory processing actively evaluates prosody, conversational flow, emotional tone, and speaker cadence while completely overlooking the low-level acoustic artifacts — vocoder fingerprints, phase coherence, missing breath spectra, and harmonic regularization — where deepfakes hide

This makes sense when you think about what we actually listen for. Humans rely on heuristics like prosody and conversational flow—exactly the things modern TTS models are trained to nail. Add phone line compression, background noise, or emotional distress to the mix, and you've created perfect conditions for audio-based social engineering.

Automated detection works great—until it doesn't

Machine learning detectors dramatically outperform humans. Systems like DeepSonar reported ~98% accuracy across various datasets and noise conditions. Other real-time detectors have hit over 99% frame-level accuracy on in-distribution data. The ASVspoof challenges have shown steady improvement in classifier performance across rounds.

What the ML Detector Sees — side-by-side log-Mel spectrogram comparison of a human voice (irregular formant transitions, breath artifacts in the 100–300Hz band, micro-tremor in harmonics) versus an AI-synthesized voice (unnaturally uniform pitch period, silent breath band, periodic harmonic grid) — the microscopic differences invisible to human ears but reliably detectable by trained CNNs

But here's the catch: a 2024 review in Cybersecurity (SpringerOpen) found that detector performance can degrade by an average of 48% when facing "unseen" conditions (see also the ASVspoof 2025 overview). That means:

  1. Audio from a generator the detector hasn't been trained on
  2. Different codecs, channels, or compression artifacts
  3. Even light post-processing

The Generalization Cliff — a sharp data visualization showing ML detector accuracy at ~98% on lab benchmark data dropping off a cliff to ~50% on real-world in-the-wild audio, annotated with the four causes of domain shift: new generators, different codecs, light post-processing, and phone compression

This is a classic domain shift problem. Lab performance doesn't guarantee real-world robustness. And in the real world, attackers aren't going to use the generators you trained against.

The threat is getting worse, fast

Several trends are compounding the problem:

  • The models keep improving. Both open-source and commercial models are getting better at capturing breath, micro-tremors, and other subtle markers of natural speech (VALL-E, NaturalSpeech 2). The commercial platforms are doing decent work on safety features, but open-source availability means those controls only apply to paying customers.
  • You barely need any audio to clone someone. A few seconds of clean audio is enough for many pipelines, and the compute is cheap (Real-Time Voice Cloning, OpenVoice). A TikTok video or a Facebook post is now sufficient source material. That's a fundamentally new attack surface for ordinary people.
  • Most people don't know this is possible. Social engineering attacks using voices that project authority (a CEO) or distress (a kidnapping victim) have demonstrably high success rates (FBI IC3 2023, FCC ruling on AI robocalls).

And this isn't theoretical. It's already happening:

  • Financial fraud: In 2019, a voice clone of a CEO was used to trick an executive into wiring €220,000 (ICAEW). In 2024, a multi-person video and voice deepfake scheme in Hong Kong led to a $25.6 million transfer (Guardian).
  • Disinformation: Voice clones mimicking President Biden were used in robocalls to New Hampshire voters ahead of the 2024 primary (NPR).
  • Extortion: Criminals are using cloned voices of loved ones to fake kidnappings and demand immediate payment (FTC consumer alert).

Defense needs layers

No single solution is going to fix this. You need multiple overlapping defenses.

Provenance and labeling. Standards like C2PA's Content Credentials create an auditable trail for digital media—where it came from, what modifications were made. This is important foundational work, but it has two big limitations: it only works if everyone adopts it (capture devices, software, platforms), and provenance data can be stripped during transcoding, compression, or re-recording. An attacker who knows about C2PA can trivially bypass it.

Better automated detection. To beat the generalization problem, detectors need to be trained for field robustness, not benchmark performance. That means diversifying training data across generators, languages, codecs, and noise profiles. It means going beyond spectrogram textures to analyze temporal dynamics, phase artifacts, and vocoder fingerprints. And it means building closed-loop pipelines that feed hard-to-detect real-world examples back into training. The detector has to keep learning from its failures.

User-side voice verification. This is the one I find most interesting as a future defense layer. The idea is essentially a "voice-of-me" system: you record your voice locally on your device, which extracts a non-invertible biometric signature—one that can verify your voice but can't be used to synthesize it. You share this signature with trusted contacts via encrypted exchange. When someone calls claiming to be you, the recipient's device runs a local check against the stored signature. Add a liveness challenge—"repeat this random phrase"—and you've got something that's genuinely hard for a real-time deepfake pipeline to beat, especially over a network with latency.

sequenceDiagram
  autonumber
  participant CD as Caller's Device
  participant CC as Secure Channel (E2E)
  participant RD as Recipient's Device
  participant LK as Local Key Store
 
  Note over CD: Enrollment — one-time setup
  CD->>CD: Record voice sample locally
  CD->>CD: Extract non-invertible biometric signature
  Note right of CD: Signature can verify identity<br/>but cannot synthesize the voice
  CD->>CC: Share signature (end-to-end encrypted)
  CC->>RD: Deliver to trusted contacts
  RD->>LK: Store trusted voice signature
 
  Note over CD,RD: Call verification — every call
  CD->>RD: Incoming call (claims to be owner)
  RD->>CD: Issue liveness challenge — "repeat phrase #4829"
  CD->>RD: Spoken response (streamed audio)
  RD->>LK: Fetch stored signature
  RD->>RD: Local biometric match + liveness check
 
  alt Signature matches AND liveness passes
    RD-->>RD: ✓ VERIFIED — identity confirmed
  else Mismatch or liveness failure
    RD-->>RD: ✗ ALERT — possible deepfake, abort
  end

This shifts the burden from unreliable human intuition under stress to an automated, device-level check. That's the right direction.

Where this leaves us

The bottom line is uncomfortable but clear. Humans can't reliably detect voice deepfakes (60-73% accuracy in studies). Automated detectors are great in the lab (98%+) but struggle badly with new, unseen attacks. And the technology to create convincing voice clones is cheap, accessible, and improving fast.

For organizations, the practical move is to pair detection tools with strict out-of-band verification for high-risk actions—if someone calls requesting a wire transfer, you call them back on a known number. For platforms, adopting provenance standards like C2PA is table stakes. For individuals, the most pragmatic defense right now is a "zero-trust" policy for voice: pre-arranged safe words, callback protocols, and healthy skepticism when someone you love calls in distress. That last one is the hardest to maintain, which is exactly why attackers exploit it.

References