Silence-the-Mimic: Accelerating Imperceptible Perturbations Against Voice Cloning

ICLR 2026 Conference Submission20717 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Adversarial Defense, Audio Security, Voice Cloning, Voice Conversion, Text-to-Speech
Abstract: Deep neural network–based Voice Conversion (VC) and Text-to-Speech (TTS) models have rapidly advanced, enabling realistic voice cloning with minimal input data. Such capabilities raise serious concerns over unauthorized cloning of speaker identities and the associated privacy and security risks. Current imperceptible adversarial protection methods rely on quality control losses that are highly sensitive to hyperparameter tuning and computationally expensive due to lengthy optimization. To address these limitations, we propose a fast yet imperceptible protection method that injects perturbations in the frequency domain under a psychoacoustic masking–based constraint. Our approach strictly enforces perceptibility bounds during adversarial training, eliminating the need for iterative quality balancing and significantly reducing computational cost. Experimental results on multiple state-of-the-art VC and TTS models show that our method achieves protection performance comparable to or better than existing baselines, with at least an order-of-magnitude speedup. These results demonstrate the effectiveness of frequency-domain perturbations with perceptual constraints as a practical paradigm for protecting against voice cloning.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 20717
Loading