Keywords: Speech, Watermark, Text-To-Speech Models, Benchmark, Voice Cloning
TL;DR: The first benchmark for evaluating in-processing LLM watermarks on Text-to-Speech synthesis.
Abstract: The rise of large language model (LLM)-based text-to-speech (TTS) synthesis has enabled unprecedented voice cloning capabilities, calling for robust content governance. In-processing watermarking, which embeds watermarks during generation, has proven effective for text and images.
The immediate research question is to adapt in-processing watermarks to LLM-based TTS models, which similarly generate discrete tokens before synthesis
Their transferability, in terms of quality and robustness, to speech remains a critical yet unverified conundrum. We present SpeechWakBench, the first large-scale benchmark to systematically evaluate the transferability of in-processing watermarking from LLMs to speech synthesis. SpeechWakBench evaluates 6 adapted in-processing LLM watermarking methods against 4 post-processing audio watermarking baselines across 3 modern LLM-based TTS models, using 16 reference-free quality metrics and a unified detectability metric under 10 attacks. Our results show that while in-processing watermarking produces slightly higher speech quality, it fails catastrophically in robustness, performing substantially worse than post-processing methods. We demonstrate that this failure is systemic, caused by the irreversible token-to-waveform conversion. This fundamental limitation highlights potential opportunities for developing novel watermarking approaches that are specifically tailored to address the unique challenges of speech synthesis.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 8913
Loading