Abstract: Speech generation has achieved significant advances through flow-matching techniques. And reinforcement learning from human feedback shown the possibility of further improving the performance of speech generation models. In this work ,we propose F5D-TTS, a framework for integrating flow-matching text-to-speech models to human preferences by directly optimizing on human preference data pairs. Specifically, we start by using the base F5-TTS model to create a preference dataset with 10000 speech pairs (about 2 hours), with a winner speech and a loser speech generated in the same prompt. Then we align flow matching TTS model with Flow-DPO. Experiments show that F5D-TTS significantly outperforms both the base F5-TTS model and the supervised-finetuned F5-TTS model in speaker similarity (measured by SIM-O) while maintaining speech intelligence (measured by WER) and speech naturalness (measured by UTMOS). We also show Flow-DPO alignment is applicable to low-resource scenarios.
Paper Type: Long
Research Area: Speech Recognition, Text-to-Speech and Spoken Language Understanding
Research Area Keywords: speech technologies
Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency, Theory
Languages Studied: English, Chinese
Submission Number: 2151
Loading