SONAR: Spectral‑Contrastive Audio Residuals for Generalizable Deepfake Detection

Ido Nitzan Hidekel; Gal Lifshitz; Khen Cohen; Dan Raviv

SONAR: Spectral‑Contrastive Audio Residuals for Generalizable Deepfake Detection

Ido Nitzan Hidekel, Gal Lifshitz, Khen Cohen, Dan Raviv

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: SONAR combats spectral bias by explicitly aligning content and high-frequency residuals, yielding better generalization and faster convergence for audio deepfake detection.

Abstract: Deepfake audio detectors often fail to generalize to unseen attacks, in part due to \emph{spectral bias}: neural networks prioritize low-frequency structure while under-exploiting subtle high-frequency (HF) artifacts left by generative models. We introduce \textbf{SONAR} (Spectral-cONtrastive Audio Residuals), a frequency-guided framework that \emph{explicitly enforces representation-level consistency} between semantic content and HF residuals. Unlike prior frequency-aware or dual-stream detectors that treat HF cues as auxiliary features, SONAR encourages structured interaction between content and noise representations in latent space. The model employs a dual-path architecture in which an XLSR encoder captures low-frequency content, while a parallel branch with learnable, value-constrained 1D SRM (Spatial Rich Model) high-pass filters distills HF residuals. The two representations are fused via frequency cross-attention and trained with a \emph{Jensen--Shannon alignment loss} that promotes LF–HF consistency for genuine audio and amplifies inconsistency for deepfakes. Evaluated on ASVspoof~2021 and in-the-wild benchmarks, SONAR achieves state-of-the-art performance in a \textbf{single run} setting and converges faster than strong baselines. By mitigating the effects of spectral bias through frequency-guided alignment, SONAR provides a fully data-driven and architecture-agnostic approach to generalizable audio deepfake detection.

Lay Summary: AI-generated voices are now convincing enough to fool people — voice-cloning scams cost millions in 2024, and synthetic audio fuels political disinformation. Existing detectors struggle to keep up: systems trained on today's fakes often fail on tomorrow's. Part of the reason is a known quirk of neural networks: they latch onto the broad, low-pitched structure of speech (what is being said) and overlook the subtle high-pitched details—exactly where AI generators tend to leave their fingerprints. Our system, SONAR, splits each clip into two parallel streams: one captures the spoken content, the other isolates only the high-frequency residuals. We then train the model with a simple rule — for real speech, the two streams should agree; for fakes, they should disagree. This forces the network to learn the fingerprints instead of ignoring them. SONAR sets a new state of the art on standard benchmarks and on harder "in-the-wild" recordings while training faster than comparable detectors, and it stays reliable when audio is compressed by common codecs. By turning a blind spot into the main clue, SONAR gives platforms, banks, and journalists a more dependable tool against voice-cloning fraud.

Link To Code: https://github.com/idonithid/SONAR-Audio-DF-Detection

Primary Area: Applications->Language, Speech and Dialog

Keywords: Speech, DeepFake Detection, Deep Learning, Inductive Bias

Originally Submitted PDF: pdf

Submission Number: 4583

Loading