MUSIC: MUlti-Step Instruction Contrast for Multi-Turn Reward Models

MUSIC: MUlti-Step Instruction Contrast for Multi-Turn Reward Models

ACL ARR 2026 January Submission3027 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Multi-turn, Reward Models

Abstract: Evaluating the quality of multi-turn conversations is crucial for developing capable Large Language Models (LLMs), yet remains a significant challenge, often requiring costly human evaluation. Multi-turn reward models (RMs) offer a scalable alternative and can provide valuable signals for guiding LLM training. While recent work has advanced multi-turn training techniques, effective automated evaluation specifically for multi-turn interactions lags behind. We observe that standard preference datasets, typically contrasting responses based only on the final conversational turn, provide insufficient signal to capture the nuances of multi-turn interactions. Instead, we find that incorporating contrasts spanning multiple turns is critical for building robust multi-turn RMs. Motivated by this finding, we propose MUlti-Step Instruction Contrast (MUSIC), an unsupervised data augmentation strategy that synthesizes contrastive conversation pairs exhibiting differences across multiple turns. Leveraging MUSIC on the Skywork preference dataset, we train a multi-turn RM based on the Gemma-2-9B-Instruct model. Empirical results demonstrate that our MUSIC-augmented RM outperforms baseline methods, achieving higher alignment with judgments from advanced proprietary LLM judges on multi-turn conversations, crucially, without compromising performance on standard single-turn RM benchmarks.

Paper Type: Short

Research Area: Language Models

Research Area Keywords: language modeling, reinforcement learning

Languages Studied: English

Submission Number: 3027

Loading