TL;DR: SONAR combats spectral bias by explicitly aligning content and high-frequency residuals, yielding better generalization and faster convergence for audio deepfake detection.
Abstract: Deepfake audio detectors often fail to generalize to unseen attacks, in part due to \emph{spectral bias}: neural networks prioritize low-frequency structure while under-exploiting subtle high-frequency (HF) artifacts left by generative models. We introduce \textbf{SONAR} (Spectral-cONtrastive Audio Residuals), a frequency-guided framework that \emph{explicitly enforces representation-level consistency} between semantic content and HF residuals. Unlike prior frequency-aware or dual-stream detectors that treat HF cues as auxiliary features, SONAR encourages structured interaction between content and noise representations in latent space. The model employs a dual-path architecture in which an XLSR encoder captures low-frequency content, while a parallel branch with learnable, value-constrained 1D SRM (Spatial Rich Model) high-pass filters distills HF residuals. The two representations are fused via frequency cross-attention and trained with a \emph{Jensen--Shannon alignment loss} that promotes LF–HF consistency for genuine audio and amplifies inconsistency for deepfakes. Evaluated on ASVspoof~2021 and in-the-wild benchmarks, SONAR achieves state-of-the-art performance in a \textbf{single run} setting and converges faster than strong baselines. By mitigating the effects of spectral bias through frequency-guided alignment, SONAR provides a fully data-driven and architecture-agnostic approach to generalizable audio deepfake detection.
Lay Summary: AI-generated voices are now convincing enough to fool people — voice-cloning scams cost millions in 2024, and synthetic audio fuels political disinformation. Existing detectors struggle to keep up: systems trained on today's fakes often fail on tomorrow's.
Part of the reason is a known quirk of neural networks: they latch onto the broad, low-pitched structure of speech (what is being said) and overlook the subtle high-pitched details—exactly where AI generators tend to leave their fingerprints. Our system, SONAR, splits each clip into two parallel streams: one captures the spoken content, the other isolates only the high-frequency residuals. We then train the model with a simple rule — for real speech, the two streams should agree; for fakes, they should disagree. This forces the network to learn the fingerprints instead of ignoring them.
SONAR sets a new state of the art on standard benchmarks and on harder "in-the-wild" recordings while training faster than comparable detectors, and it stays reliable when audio is compressed by common codecs. By turning a blind spot into the main clue, SONAR gives platforms, banks, and journalists a more dependable tool against voice-cloning fraud.
Link To Code: https://github.com/idonithid/SONAR-Audio-DF-Detection
Primary Area: Applications->Language, Speech and Dialog
Keywords: Speech, DeepFake Detection, Deep Learning, Inductive Bias
Originally Submitted PDF: pdf
Submission Number: 4583
Loading