VGPO: Fine-Tuning Speech Autoregressive Diffusion Models with Value Guided Policy Optimization

ICLR 2026 Conference Submission24842 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: text-to-speech, speech synthesis, diffusion model, continuous-valued language model, reward optimization
Abstract: Autoregressive diffusion models (ARDMs), which generate continuous latent sequences, have recently achieved state-of-the-art zero-shot text-to-speech (TTS) performance. However, fine-tuning these models with reinforcement learning (RL) to directly optimize user-defined reward functions remains an open challenge. In this work, we propose Value-Guided Policy Optimization (VGPO), an actor-critic RL algorithm tailored to ARDMs. We train a causal value model to predict expected future rewards and update the ARDM using gradients from this value model. To validate VGPO, we fine-tune the recently introduced DiTAR model and evaluate it on two tasks: improving F0 variance to enhance expressiveness; and optimizing text log-probability to improve the model's robustness to challenging long text. VGPO can achieve significant improvement in zero-shot TTS expressiveness and robustness, while maintaining naturalness and speaker similarity.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 24842
Loading