Effect of Data Format on Reward Signals for LLM Instruction Tuning

ACL ARR 2025 May Submission7960 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Finetuning LLMs with reinforcement learning rely on preference data for reward model training and/or supervised fine-tuned policy optimisation. While significant effort has gone into model architectures, RL algorithms, and pipeline engineering -- the input to this pipeline has largely remained unchanged: pairwise preferences. We argue that with the rise of AI-based synthetic labelling, the cost-efficiency of binary preferences should no longer be the deciding factor on data acquisition. In this paper, we study how annotation modality impacts reward signal for reward model training and implicit-reward-model instruction-tuning. Starting from an existing dataset with multiple completions from different chat-models, we construct four new synthetic datasets, one for each annotation modality: Binary, BinaryMagn, Ranking, Single. We measure the impact of modality on the preference data itself and on downstream reward signal by training reward models and DPO-tuned policies using each format across five different models of different sizes and families. We find that changing the input format significantly impacts the outcomes. In particular ranking-based preference annotation consistently outperforms alternatives for both reward modeling and instruction-tuning from preferences, across model scales. The improvement of ranking over binary preference is less noticable at smaller reward models but becomes more significant as model capacity increases.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: corpus creation,automatic creation and evaluation of language resources,NLP datasets,automatic evaluation of datasets, metrics,data influence
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 7960
Loading