DELTA-TTS: Adapting Autoregressive Model into a Diffusion Language Model for Text-to-Speech
Keywords: Diffusion Language Model, Text-to-Speech, LoRA
TL;DR: Lightweight AR-to-dLLM conversion for TTS via LoRA and a speech-aware convolution.
Abstract: Autoregressive (AR) text-to-speech (TTS) models generate speech tokens one at a time, and inference latency therefore scales linearly with output length. Discrete diffusion language models (dLLMs) have recently emerged as a parallel alternative that produces tokens via iterative unmasking. Recent work has converted pretrained AR language models into dLLMs in text generation, but this paradigm has not been extended to speech synthesis. In addition, existing conversion methods require full fine-tuning and large-scale training data. This leaves the question of whether such conversion can be done with substantially less compute and data largely open. We introduce DELTA-TTS, a lightweight conversion that turns a pretrained AR TTS backbone into a dLLM. The AR weights are kept frozen; all adaptation is routed through LoRA and a per-block speech-aware convolution. The convolution injects local acoustic context into the bidirectional attention, supplying the short-range continuity between adjacent speech tokens. With only $585$ hours of LibriTTS as adaptation data, DELTA-TTS achieves a state-of-the-art WER of $\textbf{1.75}\%$ on Seed-TTS test-en and decodes $\textbf{3.3}\times$ faster on the token-generation stage than the CosyVoice3 AR backbone it is converted from. This shows that lightweight AR-to-dLLM conversion provides a practical, data- and compute-efficient route to non-autoregressive TTS.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 115
Loading