Abstract: In Reinforcement Learning from Human Feedback (RLHF), the reward model (RM) evaluates the response quality based on the given context and assigns a reward.
It plays a crucial role in aligning RLHF with human preferences.
Although the current RM training paradigm concatenates the context and response while amplifying the reward difference between good and bad response pairs, we demonstrate that the RM faces two significant issues: i) it often allocates only a small proportion of attention to the context, and ii) it frequently ignores segments of the context that are relevant for evaluating the response quality.
These issues undermine the RM's effectiveness in modeling human preferences.
To further address these challenges, we propose AttnRM, a novel optimization framework that enables the RM to concentrate on crucial segments of the context.
Experimental results demonstrate that AttnRM significantly improves preference modeling by increasing attention to relevant information within the context.
It also enhances the RM's generalizability and achieves better performance in aligning with human preferences.
Paper Type: Long
Research Area: Special Theme (conference specific)
Research Area Keywords: generalization, model bias/unfairness mitigation, reinforcement learning
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models, Data analysis
Languages Studied: English
Submission Number: 2945
Loading