Enkidu: Universal Frequential Perturbation for Real-Time Audio Privacy Protection against Voice Deepfakes
Abstract: The rise of advanced voice deepfake technologies has raised serious concerns over user audio privacy, as malicious actors increasingly exploit publicly available voice data to generate convincing fake audio for malicious purposes such as identity theft, financial fraud and misinformation campaigns. While existing defense methods offer partial protection, they suffer from critical limitations, including weak adaptability to unseen user data, poor scalability to long audio, regid reliance on white-box knowledge and high computational and temporal costs to encryption process. Therefore, to defend against personalized voice deepfake threats, we propose Enkidu, a novel user-oriented privacy-preserving framework that leverages universal frequential perturbations generated through black-box knowledge and few-shot training on a small amount of user samples. These high-malleablity frequency-domain noise patches enable real-time, lightweight protection with strong generalization across variable-length audio and robust resistance against voice deepfake attacks-all while preserving high perceptual and intelligible audio quality. Notably, Enkidu achieves over 50-200× processing memory efficiency (requiring only 0.004 GB) and over 3-7000× runtime efficiency (real-time coefficient as low as 0.004) compared to six SOTA countermeasures. Extensive experiments across six mainstream Text-to-Speech (TTS) models and five cutting-edge Automated Speaker Verification (ASV) models demonstrate the effectiveness, transferability, and practicality of Enkidu in defending against voice deepfakes and adaptive attacks.
External IDs:doi:10.1145/3746027.3755629
Loading