Keywords: Singing Voice Synthesis, Reward Model, Evaluation, Data Synthesis, Reaction
Abstract: Singing voice synthesis (SVS) has advanced significantly, enabling models to generate vocals with accurate pitch, and consistent style.
As these generative capabilities improve, the need for reliable evaluation and optimization becomes increasingly critical.
However, current methods like reward systems often rely on single numerical scores, struggle to capture complex dimensions such as phrasing or expressiveness, and require costly annotations, limiting interpretability and generalization.
To address these issues, we introduce a generative feedback (i.e., reward model) framework that outputs natural language commentaries rather than a scalar value, providing interpretable and multi-dimensional evaluation signals for SVS.
Our approach traines a reward model capable of generating text commentary across melody, rhythm, creativity, and overall quality, integrating audio with contextual metadata within a pretrained model to yield multi-dimensional and interpretable feedback.
Training is conducted on a complementary dataset that combines commentary generated by MLLMs with authentic human feedback from real-world reactions, capturing both large-scale diversity and real-world evaluation patterns.
Experiments demonstrate that this framework not only improves the style consistency, and expressiveness of SVS evaluation, but also delivers stronger interpretability and better generalization and diversity compared to conventional baselines.
Submission Number: 17
Loading