Track: Track 3: AI Security, Privacy, and Adversarial Defenses
Keywords: audio deepfake detection, self-supervised learning, deep embeddings, acoustic features
TL;DR: We compare handcrafted acoustic features and SSL embeddings for audio deepfake detection on the FoR dataset, finding no significant overall superiority of SSL but highlighting Whisper and MFCC as top performers.
Abstract: This paper investigates the relative effectiveness of traditional acoustic features and self-supervised learning (SSL) embeddings for audio deepfake detection on the Fake-or-Real (FoR) corpus. We evaluate nine feature sets --- MFCC, CQCC, RMS, ZCR, Teager, Wav2Vec2, HuBERT, WavLM, and Whisper, using Accuracy, F1-score, and Equal Error Rate (EER) as metrics. At the family level, Handcrafted and SSL features exhibit very similar distributions; quartiles and Mann--Whitney U tests reveal no statistically significant differences in Accuracy, F1, or EER (\(p > 0.90\)), indicating that SSL embeddings do not globally outperform handcrafted representations.
At the feature level, however, clear patterns emerge, Whisper and MFCC are the most reliable features, with median accuracies of 0.9602 and 0.9571 and median EER of 0.0375 and 0.0391, respectively, without significant differences between them. WavLM forms a second tier with competitive but weaker results. Overall, the results show that cepstral features such as MFCC and CQCC remain robust options and that SSL embeddings, particularly Whisper and WavLM, complement rather than replace well-designed handcrafted features in audio deepfake detection.
Submission Number: 20
Loading