Abstract: Voice is one of the most widely used media for information transmission in human society. While high-quality synthetic voices are extensively utilized in various applications, they pose significant risks to content security and trust building. Numerous studies have concentrated on fake voice detection to mitigate these risks, with many claiming to achieve promising performance. However, recent research has demonstrated that existing fake voice detectors suffer from serious overfitting to speaker-irrelative features (SiFs) and cannot be used in real-world scenarios. In this paper, we analyze the limitations of existing fake voice detectors and propose a new design philosophy, guiding the detection model to prioritize learning human voice features rather than the difference between the human voice and the synthetic voice. Based on this philosophy, we propose a novel fake voice detection framework named SiFSafer, which uses pre-trained speech representation models to enhance the learning of feature distribution in human voices and the adapter fine-tuning to optimize the performance. The evaluation shows that the average EERs of existing fake voice detectors in the ASVspoof challenge can exceed 20\% if the SiFs like silence segments are removed, while SiFSafer achieves an EER of less than 8\%, indicating that SiFSafer is robust to SiFs and strongly resistant to existing attacks.
Primary Subject Area: [Generation] Social Aspects of Generative AI
Secondary Subject Area: [Generation] Generative Multimedia
Relevance To Conference: This work addresses the critical issue of fake voice detection in human society, crucial for ensuring content security and trust in multimedia applications. By proposing SiFSafer, a novel framework prioritizing the learning of human voice features, this study significantly advances the field of multimedia processing. SiFSafer effectively mitigates the overfitting problem inherent in existing detectors, thereby enhancing voice content security. This contribution is pivotal in providing a more effective solution for detecting fake voices in multimedia content, thereby bolstering trust and reliability in various multimedia applications.
Submission Number: 2554
Loading