Audio Deepfake Detection with Self-Supervised XLS-R and SLS Classifier

Published: 20 Jul 2024, Last Modified: 04 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Generative AI technologies, including text-to-speech (TTS) and voice conversion (VC), frequently become indistinguishable from genuine samples, posing challenges for individuals in discerning between real and synthetic content. This indistinguishability undermines trust in media, and the arbitrary cloning of personal voice signals presents significant challenges to privacy and security. In the field of deepfake audio detection, the majority of models achieving higher detection accuracy currently employ self-supervised pre-trained models. However, with the ongoing development of deepfake audio generation algorithms, maintaining high discrimination accuracy against new algorithms grows more challenging. To enhance the sensitivity of deepfake audio features, we propose a deepfake audio detection model that incorporates an SLS (Sensitive Layer Selection) module. Specifically, utilizing the pre-trained XLS-R enables our model to extract diverse audio features from its various layers, each providing distinct discriminative information. Utilizing the SLS classifier, our model captures sensitive contextual information across different layer levels of audio features, effectively employing this information for fake audio detection. Experimental results show that our method achieves state-of-the-art (SOTA) performance on both the ASVspoof 2021 DF and In-the-Wild datasets, with a specific Equal Error Rate (EER) of 1.92\% on the ASVspoof 2021 DF dataset and 7.46\% on the In-the-Wild dataset. Codes and data can be found at https://github.com/QiShanZhang/SLSforADD.
Primary Subject Area: [Generation] Social Aspects of Generative AI
Secondary Subject Area: [Generation] Social Aspects of Generative AI
Relevance To Conference: This work significantly contributes to the field of multimedia and multimodal processing by addressing the growing challenge of detecting deepfake audio content, a pressing issue in the era of generative AI technologies. With advancements in text-to-speech (TTS) and voice conversion (VC) technologies, distinguishing between real and synthetic audio content has become increasingly difficult, undermining trust in media and posing risks to privacy and security. Our proposed model leverages a novel Sensitive Layer Selection (SLS) module in conjunction with the pre-trained XLS-R framework to enhance the sensitivity of deepfake audio feature detection. By extracting diverse audio features across various layers and capturing sensitive contextual information, our approach not only advances the state-of-the-art in deepfake audio detection but also addresses critical concerns in multimedia authenticity and security. Experimental results demonstrating superior performance on prominent datasets like ASVspoof 2021 DF and In-the-Wild, underline the efficacy of our method in discriminating between real and synthetic audio. Hence, this work is directly aligned with the core interests of the ACM MM community, offering innovative solutions for safeguarding multimedia content integrity and contributing to the broader discourse on multimedia and multimodal processing in the face of evolving digital threats.
Supplementary Material: zip
Submission Number: 3501
Loading