DurMI: Duration Loss as a Membership Signal in TTS Models

ICLR 2026 Conference Submission17840 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Text-to-Speech, Membership Inference Attack, privacy, security
TL;DR: We propose DurMI, a fast and accurate white-box membership inference attack for Text-to-Speech models that leverages duration loss, achieving up to 100× speedup over prior methods.
Abstract: Text-to-speech (TTS) models such as FastSpeech2, Grad-TTS, and VITS2 achieve state-of-the-art quality but risk memorizing and leaking sensitive training data. Existing membership inference attacks (MIAs) for diffusion-based TTS models typically rely on denoising errors, which are costly to compute and weak at capturing sample-specific memorization. We introduce DurMI, the first membership inference attack that exploits duration loss, a core alignment signal in TTS models, as a discriminative indicator of membership. Duration loss captures the model’s tendency to overfit alignment targets, whether derived from deterministic aligners such as MAS and MFA or from stochastic predictors as in VITS2. Leveraging this signal, DurMI enables accurate inference with a single forward pass, while remaining broadly applicable across diverse TTS architectures. Experiments across diverse architectures, including diffusion (Grad-TTS, WaveGrad2), flow-matching (VoiceFlow), transformer (FastSpeech2), and stochastic-duration (VITS2), on three benchmarks show that DurMI consistently outperforms prior MIAs, including on waveform-level synthesis where existing attacks fail. These results highlight DurMI’s effectiveness, efficiency, and broad applicability, underscoring the need for privacy-preserving training in modern TTS systems.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 17840
Loading