Keywords: text-to-speech, speech synthesis, diffusion model, continuous-valued language model, reward optimization
Abstract: Autoregressive diffusion models (ARDMs), which generate continuous latent sequences, have recently achieved state-of-the-art zero-shot text-to-speech (TTS) performance. However, fine-tuning these models with reinforcement learning (RL) to directly optimize user-defined reward functions remains an open challenge. In this work, we propose Value-Guided Policy Optimization (VGPO), an actor-critic RL algorithm tailored to ARDMs. We train a causal value model to predict expected future rewards and update the ARDM using gradients from this value model. To validate VGPO, we fine-tune the recently introduced DiTAR model and evaluate it on two tasks: improving F0 variance to enhance expressiveness; and optimizing text log-probability to improve the model's robustness to challenging long text. VGPO can achieve significant improvement in zero-shot TTS expressiveness and robustness, while maintaining naturalness and speaker similarity.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 24842
Loading