My 350M parameter deepfake detector scored 100% on tests and failed in the real world

Deepfake-Eval-2024 (Chandra et al.) collected 45 hours of video, 56.5 hours of audio, and nearly 2,000 images from social media and deepfake detection platforms — real fakes circulating in the wild across 88 websites in 52 languages. When they evaluated state-of-the-art open-source detectors on this data, performance collapsed: AUC dropped by 50% for video models, 48% for audio models, and 45% for image models compared to academic benchmarks. Models that reported near-perfect scores on curated datasets were effectively broken on the kind of deepfakes people actually encounter online.

The implication was clear. Academic benchmarks are outdated and unrepresentative. The detectors the research community has been building look impressive in the lab but fail where it matters. The gap between reported performance and real-world performance isn't a minor calibration issue — it's a chasm.

I wanted to test this empirically on the audio side. I ran 50 experiments across four architectures, multiple datasets, different audio codecs, and various training configurations. The critical evaluation benchmark was the Deepfake-Eval-2024 audio set — 1,973 files from deepfake generators that none of my models had ever seen during training. Exactly the kind of out-of-distribution test that separates detectors that have learned something real from detectors that have merely memorized their training set.

The results confirmed the Deepfake-Eval-2024 findings more dramatically than I expected. The model I expected to win lost badly, the model I almost didn't bother training won, and the gap between "test set accuracy" and "real-world performance" was wider than anything I'd anticipated.

Here's everything that happened.

The Setup

Architectures

I tested four architectures, ranging from a tiny CNN to a full self-supervised learning pipeline:

Mel-CNN (~400K parameters, 435 KB per checkpoint) A lightweight convolutional neural network operating directly on mel-spectrograms with delta (temporal derivative) features. Two input channels (mel + delta mel), a simple convolutional backbone, and an embedding dimension of 64. This was my baseline — the simplest thing that could possibly work.

Mel-AASIST (~2M parameters, 2.2 MB per checkpoint) An adapted version of AASIST (Audio Anti-Spoofing using Integrated Spectro-Temporal graph attention networks). My version uses three parallel multi-scale convolution branches (3x3, 5x5, and 7x7 dilated with dilation=3) that capture artifacts at different time-frequency resolutions. Each block includes squeeze-and-excitation (SE) attention for channel recalibration (reduction ratio 16) and attentive statistics pooling (ASP) for temporal aggregation — instead of simple average pooling, ASP learns to weight different time frames based on their relevance. Three multi-scale conv blocks stacked, with embedding dimension 64 and dropout 0.3.

WavLM + CNN (~350M+ parameters, ~1.3 GB per checkpoint) Microsoft's WavLM-Large — a self-supervised model pretrained on 94,000 hours of speech — frozen as a feature extractor, with my Mel-CNN as the classifier head. The idea: let the pretrained model provide rich audio representations, and train a lightweight classifier on top.

WavLM + AASIST (~350M+ parameters, ~1.3 GB per checkpoint) The full pipeline. Frozen WavLM-Large backbone with three feature streams fused together: WavLM embeddings projected to 256 dimensions, mel-spectrogram features through the AASIST encoder, and LFCC (Linear Frequency Cepstral Coefficients) as an alternative spectral representation. A fusion network with hidden dimension 256 combines all three streams. This was my "throw everything at the wall" architecture.

Audio Processing Pipeline

Every model shared the same audio frontend:

Parameter	Value
Sample rate	16,000 Hz
Clip duration	3 seconds (48,000 samples)
Mel bins	128
FFT size	1,024
Hop length	160
Frequency range	20 Hz – 7,600 Hz
Amplitude to dB	top_db = 80.0
Normalization	Per-sample mean/std

Audio longer than 3 seconds was center-cropped; shorter clips were zero-padded. Stereo was mixed to mono. Peak normalization (divide by max absolute value) was applied before feature extraction.

Training Configuration

Parameter	Value
Optimizer	AdamW
Loss function	BCEWithLogitsLoss
Learning rate	5e-5
Weight decay	0.01
Dropout	0.3
Batch size	128 (mel-only) / 64 (WavLM)
Mixed precision	Yes (CUDA AMP)
Gradient clipping	max_norm = 1.0
TF32	Enabled
Training split	95% train / 5% validation
Hardware	NVIDIA GeForce RTX 5090

Datasets

Dataset	Description	Role
Custom dataset (train)	Samples from multiple deepfake generators + real speech	Primary training set
Custom holdout (polished)	Curated held-out split from the custom dataset	Test set for custom-trained models
ASVspoof5 3s FLAC (eval)	ASVspoof 2024 challenge evaluation set, 3-second clips in FLAC format	Training and evaluation for ASV experiments
Deepfake-Eval-2024	External benchmark, 1,973 files from unseen generators	Out-of-distribution generalization benchmark

Evaluation Metrics

I tracked accuracy, precision, recall, F1, Matthews Correlation Coefficient (MCC), Equal Error Rate (EER), sensitivity, and specificity across all runs. EER is particularly important — it's the threshold where false acceptance rate equals false rejection rate, and it's the standard metric in the anti-spoofing community.

Part 1: The Main Architecture Comparison

I trained all four architectures on the ASV5 balanced dataset. Mel-only models ran for 50 epochs; WavLM models ran for 15 epochs (each WavLM epoch takes roughly 10x longer due to the frozen feature extraction step).

Held-Out Test Set Results (Best Epoch)

Model	Epochs	Test Acc	Test F1	Test Prec	Test Recall	Test MCC	Test EER
WavLM + AASIST	15	99.18%	99.25%	–	–	–	0.0078
WavLM + CNN	15	98.76%	–	–	–	–	0.0124
Mel-AASIST	50	98.55%	98.68%	97.93%	99.43%	0.9710	0.0121
Mel-CNN	50	96.33%	96.65%	95.47%	97.87%	0.9262	0.0342

The ranking made intuitive sense. WavLM + AASIST had the most parameters and the richest features. Mel-CNN had the least. The spread was tight — only 2.85 percentage points separated the best from the worst.

Then I ran Deepfake-Eval-2024.

Deepfake-Eval-2024 Results (Out-of-Distribution)

Model	Acc	Prec	Recall	F1	MCC	EER
Mel-AASIST	83.17%	78.92%	80.67%	0.7978	0.6539	0.1759
WavLM + AASIST	75.67%	84.58%	50.00%	0.6285	0.5004	0.2494
WavLM + CNN	73.54%	65.90%	74.01%	0.6972	0.4662	0.2644
Mel-CNN	72.48%	62.47%	83.00%	0.7129	0.4757	0.2558

The ranking completely inverted. The model that was worst on the test set (by EER) was now best by a wide margin. Mel-AASIST beat the nearest WavLM model by 7.5 percentage points on accuracy and achieved an EER nearly 8 points lower.

Architecture comparison showing test accuracy vs real-world DF-Eval accuracy — the rankings completely invert between the two benchmarks

But the precision/recall breakdown reveals something even more interesting about how each model fails:

WavLM + AASIST had the highest precision (84.58%) but the lowest recall (50.00%). It was conservative — when it flagged something as fake, it was usually right. But it missed half of all deepfakes entirely. A coin flip on whether it catches a fake.
Mel-CNN had the opposite problem: low precision (62.47%) but high recall (83.00%). It caught most fakes but generated a lot of false positives.
Mel-AASIST hit the sweet spot: 78.92% precision and 80.67% recall. Balanced performance, which is why its F1 (0.7978) was far ahead of the pack.
WavLM + CNN landed in the middle on both axes (65.90% precision, 74.01% recall).

The MCC scores tell the same story more starkly. Mel-AASIST at 0.6539 indicates strong agreement between predictions and ground truth. WavLM + CNN at 0.4662 is moderate at best.

Precision vs recall scatter plot showing each architecture's tradeoff on DF-Eval-2024 — Mel-AASIST sits in the balanced sweet spot while WavLM+AASIST is precise but misses half of all fakes

Part 2: Training Dynamics

Raw final numbers hide important patterns. Here's how each model evolved during training.

WavLM + CNN Training Curve (15 epochs)

Epoch	Val Acc	Val EER	Test Acc	Test EER
1	87.90%	0.1210	94.38%	0.0537
4	91.37%	0.0864	97.28%	0.0291
10	92.32%	0.0768	98.76%	0.0124
15	91.76%	0.0824	98.63%	0.0137

Notice the val accuracy peaked at epoch 10 (92.32%) and started declining by epoch 15 (91.76%), while test accuracy barely moved. Classic early signs of overfitting. The model squeezed out the last 0.13% of test accuracy between epoch 10 and 15 but lost 0.56% on validation. I should have stopped at epoch 10.

WavLM + AASIST Training Curve (15 epochs)

Epoch	Val Acc	Val EER	Test Acc	Test F1	Test EER
2	89.46%	0.1054	97.40%	97.60%	0.0245
7	93.25%	0.0675	99.13%	99.20%	0.0090
11	93.19%	0.0681	99.23%	99.29%	0.0073
14	93.57%	0.0643	99.18%	99.25%	0.0060
15	93.94%	0.0606	99.18%	99.25%	0.0078

This model trained more stably. Validation EER steadily improved from 0.1054 to 0.0606 over 15 epochs. The test EER peaked at epoch 14 (0.0060) and slightly regressed at epoch 15 (0.0078). The AASIST architecture's attention mechanisms likely helped regularize the frozen WavLM features.

Mel-AASIST and Mel-CNN (50 epochs)

The mel-only models told a different story. The Mel-AASIST reached best validation loss at epoch 14 (val_loss=0.2895), while Mel-CNN hit best validation at epoch 19 (val_loss=0.3454). But — and this is crucial — their Deepfake-Eval performance continued improving well past the point where training loss had converged and validation loss had plateaued.

For Mel-AASIST:

Train loss at epoch 50: 0.0169 (essentially converged by epoch ~15)
Final val accuracy: 89.38%, val EER: 0.1062
But DF-Eval accuracy at epoch 50: 83.17% — a number I never would have reached if I'd stopped at the "optimal" early stopping point

For Mel-CNN:

Train loss at epoch 50: 0.0992 (higher than AASIST, suggesting the simpler architecture struggled more with the training distribution)
Final val accuracy: 85.91%, val EER: 0.1409
DF-Eval accuracy at epoch 50: 72.48%

The gap between validation EER (0.1062 for AASIST vs 0.1409 for CNN) predicted the gap in generalization performance (83.17% vs 72.48% on DF-Eval). Validation EER was a much better predictor of real-world performance than test accuracy.

Part 3: The 100% Accuracy Trap

In a separate set of experiments, I fine-tuned WavLM + AASIST models on my custom dataset instead of ASV5. WavLM stayed frozen. I ran two variants: one trained on clean data, one on corrupted data.

Clean Data Fine-Tuning

Epoch	Train Loss	Train Acc	Val Acc	Val EER	Test Acc	Test F1	Test MCC
1	0.0340	98.93%	45.64%	0.5436	95.71%	97.66%	0.7542
6	0.0018	99.94%	46.84%	0.5368	99.92%	99.96%	0.9960
10	0.0006	99.98%	49.68%	0.5058	100.00%	100.00%	1.0000
15	0.0003	99.99%	49.76%	0.4998	100.00%	100.00%	1.0000
20	0.0001	100.00%	50.05%	0.4980	100.00%	100.00%	1.0000

Deepfake-Eval (epoch 10): Acc=59.15%, Prec=100.00%, Recall=0.74%, F1=0.0147, MCC=0.0660, EER=0.4870 Deepfake-Eval (epoch 20): Acc=59.25%, Prec=100.00%, Recall=0.99%, F1=0.0195, MCC=0.0763, EER=0.4838

Read those DF-Eval numbers carefully. 100% precision but 0.74% recall. The model almost never predicted "fake" on out-of-distribution data. When it did (7 out of ~810 fake samples), it happened to be right. But it missed 99% of all deepfakes. The 59% accuracy comes almost entirely from correctly labeling real speech as real — because the model learned to call everything real.

The validation set was screaming at me. Val EER hovered at 0.50 from epoch 1 to epoch 20 — literally random chance — while test accuracy climbed to 100%. This is the most extreme train/val divergence I've ever seen. The model didn't learn a single generalizable feature. It memorized every training sample.

Dual-axis chart showing test accuracy climbing to 100% while validation EER stays pinned at 0.50 (random chance) — the most extreme train/val divergence in the experiments

Corrupt Data Fine-Tuning

Epoch	Train Loss	Train Acc	Val Acc	Val EER	Test Acc	Test F1	Test MCC	Test EER
1	0.0878	96.78%	43.42%	0.5658	93.69%	96.59%	0.6158	0.0103
7	0.0145	99.48%	51.05%	0.4876	98.65%	99.25%	0.9273	0.0041
14	0.0056	99.80%	51.96%	0.4849	98.99%	99.44%	0.9460	0.0027
20	0.0044	99.85%	50.80%	0.4960	99.23%	99.57%	0.9591	0.0011

Deepfake-Eval (epoch 10): Acc=59.15%, Prec=65.00%, Recall=1.60%, F1=0.0312, MCC=0.0490, EER=0.4931 Deepfake-Eval (epoch 20): Acc=59.00%, Prec=63.64%, Recall=0.86%, F1=0.0170, MCC=0.0342, EER=0.4850

The corrupt variant was slightly less extreme — it didn't hit 100% test accuracy, settling at 99.23%. But the DF-Eval story was identical: ~59% accuracy, sub-1% recall, EER around 0.49. Training on corrupted audio added no regularization benefit whatsoever.

Both models were confidently, perfectly wrong. They solved the training set and learned nothing about deepfakes.

Part 4: The 20-Epoch vs 50-Epoch Story

One of my most actionable findings came from comparing the same architecture at different training lengths.

I ran Mel-AASIST for 20 epochs on clean ASV5 FLAC in the ablation study, and separately for 50 epochs on the ASV5 balanced dataset.

Mel-AASIST at 20 Epochs (ASV5 FLAC Clean)

Epoch	Train Loss	Train Acc	Val Acc	Val EER	Test Acc	Test F1	Test EER
1	0.1086	96.20%	67.60%	0.3238	97.19%	98.45%	0.0382
8	0.0041	99.88%	66.13%	0.3386	98.21%	99.01%	0.0010
12	0.0028	99.91%	66.59%	0.3340	98.74%	99.30%	0.0010
15	0.0014	99.96%	66.22%	0.3377	98.53%	99.18%	0.0010
20	0.0006	99.98%	64.73%	0.3526	98.05%	98.92%	0.0041

DF-Eval (epoch 10): Acc=59.96%, Prec=54.58%, Recall=16.13%, F1=0.2490, MCC=0.1015, EER=0.4763 DF-Eval (epoch 20): Acc=58.59%, Prec=47.97%, Recall=7.27%, F1=0.1262, MCC=0.0357, EER=0.5161

At 20 epochs: 58.59% DF-Eval accuracy, 0.5161 EER. Essentially random.

Mel-AASIST at 50 Epochs (ASV5 Balanced)

At 50 epochs: 83.17% DF-Eval accuracy, 0.1759 EER.

That's a 24.6 percentage point jump in DF-Eval accuracy and a 0.34 drop in EER just from training longer. The training loss was already near zero by epoch 15 in both runs. By conventional early stopping logic, there was no reason to keep training. But the model was still learning something — some slow-forming, generalizable representation of what makes audio fake — that only showed up on out-of-distribution data.

Mel-AASIST DF-Eval metrics at 20 vs 50 epochs — a 24.6 percentage point jump in accuracy just from training longer past the point where loss had converged

This has significant practical implications. If you're training a deepfake detector and evaluating only on in-distribution metrics, you'll stop too early. The features that generalize take longer to learn than the features that memorize.

Two additional observations from the 20-epoch run:

Validation EER was around 0.33 throughout — much better than the ~0.50 seen in the WavLM fine-tuning experiments. The mel-only model was at least learning some transferable features, even at 20 epochs.
Test accuracy actually degraded slightly from epoch 12 (98.74%) to epoch 20 (98.05%), suggesting the model was beginning to overfit to training data, yet DF-Eval could have improved with more training. In-distribution and out-of-distribution performance can move in opposite directions.

Part 5: The Codec and Data Ablation

I ran a systematic ablation across audio formats, data corruption, and training data source. Six CNN variants were trained and evaluated on Deepfake-Eval-2024 at a decision threshold of 0.5:

Variant	Dataset	Format	Corrupt	Acc	Prec	Recall	F1	MCC	Spec
Clean Custom CNN	Custom	Mixed	No	77.33%	87.18%	52.47%	0.6551	0.5372	94.63%
Corrupt Custom CNN	Custom	Mixed	Yes	75.93%	89.64%	46.76%	0.6146	0.5157	96.24%
FLAC CNN	ASV5	FLAC	No	61.56%	79.71%	8.49%	0.1534	0.1681	98.50%
FLAC Corrupt CNN	ASV5	FLAC	Yes	53.39%	32.81%	12.96%	0.1858	-0.0736	81.53%
MP3 CNN	ASV5	MP3	No	59.78%	65.12%	4.32%	0.0810	0.0819	98.39%
MP3 Corrupt CNN	ASV5	MP3	Yes	41.04%	41.04%	100.00%	0.5819	0.0000	0.00%

Data source and codec ablation — custom dataset models dominate ASV5 regardless of format, and corruption never helps

There are several stories in this table:

1. Data source dominates everything. Custom dataset models (77.33%, 75.93%) crushed ASV5 models (41-62%) regardless of codec or corruption. The custom dataset included samples from more deepfake generators, and that diversity was the single largest lever for generalization. A 15-18 percentage point gap from data diversity alone.

2. Codec matters, but less than data source. Within ASV5, FLAC (61.56%) slightly outperformed MP3 (59.78%) on clean data. FLAC preserves more spectral detail that the model can use to distinguish real from fake. MP3 compression smears the very artifacts the detector needs to find.

3. Corruption is neutral to harmful. Clean vs corrupt performance within the same data source:

Custom: 77.33% clean vs 75.93% corrupt (-1.4%)
ASV5 FLAC: 61.56% clean vs 53.39% corrupt (-8.2%)
ASV5 MP3: 59.78% clean vs 41.04% corrupt (-18.7%)

Corruption never helped. On ASV5 MP3, it was catastrophic — the corrupt model collapsed to 41% accuracy with 0.00% specificity, meaning it classified every single sample as fake.

4. Failure modes differ dramatically. The FLAC and MP3 CNN models were extremely conservative (very high specificity of 98%+, very low recall of 4-13%). They almost never flagged anything as fake. The MP3 corrupt model went the other direction entirely — 100% recall, 0% specificity, labeling everything as fake. The custom dataset models landed in a healthier middle ground.

5. The MCC reveals the true picture. The FLAC corrupt CNN had a negative MCC (-0.0736), meaning its predictions were anti-correlated with ground truth. It would have been more accurate if you flipped its labels. The MP3 corrupt CNN had MCC of exactly 0.0000 — no correlation at all, identical to random guessing (which makes sense given its predict-everything-as-fake strategy).

I also ran AASIST variants in the same ablation across ASV5 FLAC and MP3, plus a standalone ablation of Mel-AASIST on the custom dataset. The numbered result directories (01-12) capture the full matrix:

#	Config	Arch	Data	Corrupt
01	Custom AASIST + WavLM	AASIST	Custom	Clean
02	Custom AASIST + WavLM	AASIST	Custom	Corrupt
03	Custom CNN + WavLM	CNN	Custom	Clean
04	Custom CNN + WavLM	CNN	Custom	Corrupt
05	ASV FLAC AASIST Mel	AASIST	ASV5 FLAC	Clean
06	ASV FLAC AASIST Mel	AASIST	ASV5 FLAC	Corrupt
07	ASV FLAC CNN Mel	CNN	ASV5 FLAC	Clean
08	ASV FLAC CNN Mel	CNN	ASV5 FLAC	Corrupt
09	ASV MP3 AASIST Mel	AASIST	ASV5 MP3	Clean
10	ASV MP3 AASIST Mel	AASIST	ASV5 MP3	Corrupt
11	ASV MP3 CNN Mel	CNN	ASV5 MP3	Clean
12	ASV MP3 CNN Mel	CNN	ASV5 MP3	Corrupt

Plus cross-domain transfer experiments fine-tuning ASV5-pretrained WavLM+AASIST models on the custom holdout set (clean and corrupt variants, three runs each with different timestamps).

Part 6: Earlier Experiments and the Polished vs Unpolished Gap

Before the systematic ablation, I ran earlier WavLM + CNN experiments that revealed another important finding: the gap between "polished" and "unpolished" evaluation data.

These earlier runs used two validation sets simultaneously — an unpolished holdout (raw recordings) and a polished holdout (cleaned/curated recordings):

Run	Epochs	Val Acc (Unpol.)	Val EER (Unpol.)	Val Acc (Pol.)	Val EER (Pol.)	Test Acc
Resumed ep. 17	18	93.22%	0.0677	86.32%	0.1377	97.77%
Resumed ep. 17	19	92.48%	0.0752	86.73%	0.1347	–
Fresh start	1	91.93%	0.0807	85.16%	0.1537	92.81%
Large dataset	1	84.29%	0.1570	81.04%	0.1498	98.46%

The polished data was consistently harder — a 5-7 percentage point accuracy gap and roughly double the EER. This is important because real-world deepfakes are increasingly polished. A detector that works on raw generator output but fails on post-processed audio has limited practical value.

The large dataset run is also notable: despite only reaching 84.29% validation accuracy after 1 epoch, it hit 98.46% test accuracy. This suggests the model was rapidly memorizing the test distribution while struggling with the validation set's diversity — another early signal of the overfitting pattern I'd see more dramatically in later experiments.

Part 7: Why Did the Small Model Win?

I don't have a definitive answer, but the data points toward a hypothesis.

The WavLM representation space is too specific. WavLM was pretrained to understand speech — phonemes, prosody, speaker identity, linguistic content. When you freeze those representations and train a binary classifier on top, the classifier learns to detect artifacts in the space of speech representations. Those artifacts turn out to be specific to whatever deepfake generators exist in the training data. When the generator changes, the artifacts in WavLM-space change too, and the classifier breaks.

The evidence for this: WavLM models achieved the highest test accuracy (99.18%) but showed the most dramatic DF-Eval degradation. The frozen features are powerful enough to perfectly separate training-distribution real from fake, but that separation doesn't generalize.

Mel-spectrograms force more universal feature learning. Mel-AASIST operates directly on mel-spectrograms. It has to learn its own features from scratch. With multi-scale convolutions (3x3, 5x5, and dilated 7x7 kernels), it's forced to find patterns at multiple time-frequency resolutions simultaneously. Squeeze-and-excitation attention lets it learn which frequency bands matter most. Attentive statistics pooling lets it learn which time frames are most informative.

These patterns — subtle spectral inconsistencies, unnatural harmonic structures, phase artifacts, overly smooth formant transitions — appear to be more universal across different deepfake generators than whatever the WavLM classifier was latching onto.

The parameter count constraint may be a feature, not a bug. With only ~2M parameters (vs 350M+), Mel-AASIST can't memorize as easily. It's forced into compression — finding compact, generalizable representations rather than storing specific examples. This is the classic bias-variance tradeoff manifesting in a particularly dramatic way.

The recall pattern supports this. WavLM + AASIST achieved 84.58% precision but only 50% recall on DF-Eval. It was confident when it predicted "fake" but missed half of all fakes. This suggests it learned a narrow set of generator-specific tells — when it saw those tells, it was right, but most out-of-distribution fakes don't have those specific tells. Mel-AASIST achieved 78.92% precision and 80.67% recall — a more balanced detector that recognizes a wider variety of fakeness signals.

The Complete Picture

Putting all results together in one table, ranked by Deepfake-Eval accuracy:

Rank	Model	Data	Ep	Test Acc	DF-Eval Acc	DF-Eval F1	DF-Eval EER
1	Mel-AASIST	ASV5 bal.	50	98.55%	83.17%	0.7978	0.1759
2	Clean Custom CNN	Custom	3	–	77.33%	0.6551	–
3	Corrupt Custom CNN	Custom	10	–	75.93%	0.6146	–
4	WavLM + AASIST	ASV5	15	99.18%	75.67%	0.6285	0.2494
5	WavLM + CNN	ASV5	15	98.76%	73.54%	0.6972	0.2644
6	Mel-CNN	ASV5 bal.	50	96.33%	72.48%	0.7129	0.2558
7	FLAC CNN	ASV5 FLAC	3	–	61.56%	0.1534	–
8	Mel-AASIST	ASV5 FLAC	20	98.05%	58.59%	0.1262	0.5161
9	MP3 CNN	ASV5 MP3	1	–	59.78%	0.0810	–
10	WavLM+AASIST (FT clean)	Custom	20	100.00%	59.25%	0.0195	0.4838
11	WavLM+AASIST (FT corr.)	Custom	20	99.23%	59.00%	0.0170	0.4850
12	FLAC Corrupt CNN	ASV5 FLAC	2	–	53.39%	0.1858	–
13	MP3 Corrupt CNN	ASV5 MP3	9	–	41.04%	0.5819	–

All 13 models ranked by real-world DF-Eval accuracy — test accuracy has essentially zero correlation with real-world performance

The correlation between test accuracy and DF-Eval accuracy is essentially zero. The model with the highest test accuracy (100.00%) had one of the worst DF-Eval scores (59.25%). The model with the lowest test accuracy in the main comparison (96.33%) outperformed two of the three WavLM models on DF-Eval.

Lessons Learned

1. Test set accuracy is nearly meaningless for deepfake detection. Every main model scored above 96% on held-out test data. The gap between the best and worst on real-world data was over 24 percentage points. The correlation between the two was negative. If you're evaluating a deepfake detector and only reporting in-distribution accuracy, you're not measuring anything useful.

2. Bigger models aren't better — they're better at memorizing. WavLM-Large has 175x more parameters than Mel-AASIST. Those extra parameters bought better memorization of the training distribution and worse generalization to everything else. The frozen SSL features, which are supposed to provide "general" audio understanding, actually created a representation space where overfitting was easier, not harder.

3. Perfect scores are a red flag, not a celebration. When your model hits 100% accuracy, your first reaction should be suspicion, not satisfaction. In my case, 100% test accuracy coexisted with coin-flip performance on real-world data. The model hadn't solved deepfake detection. It had solved the training set.

4. Data diversity beats data quantity, model size, and architecture. The single biggest lever for Deepfake-Eval performance wasn't architecture or parameter count — it was training on data from diverse deepfake generators. A 400K parameter CNN trained on my custom dataset (77.33%) outperformed a 350M+ parameter WavLM model trained on ASV5 (73.54%). The custom dataset models outperformed ASV5 models by 15-18 percentage points across the board.

5. Train longer than you think you need to. Mel-AASIST went from 58.59% DF-Eval accuracy at 20 epochs to 83.17% at 50 epochs. Training loss had converged by epoch 15. The generalizable features took 3x longer to form than the memorization features. Conventional early stopping based on validation loss would have killed the run before the good features emerged.

6. Audio codec choice matters. FLAC-trained models consistently outperformed MP3-trained models. MP3 compression destroys the subtle spectral artifacts that differentiate real from fake audio. If you're building a deepfake detector, train on lossless audio.

7. Data corruption doesn't help. Clean vs corrupt training showed negligible differences at best and catastrophic degradation at worst (MP3 corrupt: 41% accuracy). The bottleneck is exposure to diverse generation methods, not robustness to noise. Don't waste cycles on audio corruption augmentation for this task.

8. Watch your validation set, not your test set. In the WavLM fine-tuning experiments, validation accuracy sat at ~50% while test accuracy climbed to 100%. That 50-point divergence was the model screaming that it was overfitting. In the mel-only experiments, validation EER (0.10 for AASIST vs 0.14 for CNN) correctly predicted which model would generalize better. The validation set is your early warning system — trust it over the test set.

9. Precision/recall balance predicts real-world utility. High precision with low recall (WavLM + AASIST: 84.58% / 50.00%) means the model is conservative but misses most fakes. High recall with low precision (Mel-CNN: 62.47% / 83.00%) means it catches fakes but generates false alarms. Balanced precision and recall (Mel-AASIST: 78.92% / 80.67%) is what you want for a production system. Accuracy alone doesn't tell you this.

10. The polished data gap is real. Earlier experiments showed a consistent 5-7 percentage point accuracy drop when evaluating on polished (post-processed) audio vs raw audio. Real-world deepfakes are increasingly polished. Evaluation on raw generator output overestimates real-world performance.

Model Size vs Performance

Architecture	Params	Size	DF-Eval Acc	DF-Eval EER	Acc/MB
Mel-CNN	~400K	435 KB	72.48%	0.2558	166.6%/MB
Mel-AASIST	~2M	2.2 MB	83.17%	0.1759	37.8%/MB
WavLM + CNN	~350M	~1.3 GB	73.54%	0.2644	0.06%/MB
WavLM + AASIST	~350M	~1.3 GB	75.67%	0.2494	0.06%/MB

Model size vs real-world performance — Mel-AASIST at 2M parameters outperforms WavLM models 175x its size

Mel-AASIST delivers the best absolute performance at 2.2 MB. The WavLM models are 590x larger for worse results. Mel-CNN is the efficiency champion on a per-megabyte basis but trails on absolute performance.

For production deployment where model size, latency, and compute cost matter, Mel-AASIST is the clear winner by every metric.

The uncomfortable truth about audio deepfake detection is that the field has a generalization problem. Models that look incredible on benchmarks fall apart when the generator changes. My results suggest that the path forward isn't bigger pretrained models — it's architectures and training strategies that force the model to learn universal artifacts rather than generator-specific fingerprints.

Sometimes less really is more.