Fooling the Forgers: A Multi-Stage Framework for Audio Deepfake Detection

Published: 2025, Last Modified: 18 Mar 2026ICASSP 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Audio deepfakes represent a risk to society as they can deteriorate society’s trust in any audio. In this paper, we present a novel approach for audio deepfake detection using Generative Adversarial Networks (GANs) and contrastive learning in a multi-stage detection framework. In our process, we apply the Pre-trained Models (PTM) to extract all suitable audio phonetics, speaker identity, and other spatial prosodic features or contents, which are crucial for the model. We enhance the model’s performance by utilizing a GAN data augmentation strategy in combination with HiFi-GAN. The Contrastive learning approach is then used for improving the model’s ability to discriminate real speech from fake speech. Our experiments demonstrate that this method is superior to existing methodologies in detection and robustness.
Loading