Lost in the Context: Insufficient and Distracted Attention to Contexts in Preference Modeling

Lost in the Context: Insufficient and Distracted Attention to Contexts in Preference Modeling

ACL ARR 2025 February Submission2945 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: In Reinforcement Learning from Human Feedback (RLHF), the reward model (RM) evaluates the response quality based on the given context and assigns a reward. It plays a crucial role in aligning RLHF with human preferences. Although the current RM training paradigm concatenates the context and response while amplifying the reward difference between good and bad response pairs, we demonstrate that the RM faces two significant issues: i) it often allocates only a small proportion of attention to the context, and ii) it frequently ignores segments of the context that are relevant for evaluating the response quality. These issues undermine the RM's effectiveness in modeling human preferences. To further address these challenges, we propose AttnRM, a novel optimization framework that enables the RM to concentrate on crucial segments of the context. Experimental results demonstrate that AttnRM significantly improves preference modeling by increasing attention to relevant information within the context. It also enhances the RM's generalizability and achieves better performance in aligning with human preferences.

Paper Type: Long

Research Area: Special Theme (conference specific)

Research Area Keywords: generalization, model bias/unfairness mitigation, reinforcement learning

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models, Data analysis

Languages Studied: English

Submission Number: 2945

Loading