Keywords: reinforcement learning, reinforcement learning from human feedback, Boltzmann rational model, preference learning
Abstract: Learning utilities from preference feedback has become increasingly important, particularly in fine-tuning language models such as ChatGPT.
Traditional methods often assume equal rationality among labellers, leading to inaccurate utility estimates.
We propose an algorithm that jointly estimates trainer rationality and item utilities to enhance utility learning and gain additional insights from feedback.
Our approach focuses on settings where feedback is received from multiple trainers,
using the Boltzmann-rational model to relate choices to latent utilities while accounting for varying levels of rationality.
Given shared utilities, our method identifies rationality ratios among trainers from observed choices without extra calibration data or assumptions.
We analyse the theoretical impact of assuming equal rationality on utility accuracy and empirically show superior performance in an action-advice setting, where agents construct policies using the learned utilities as rewards.
By accurately modelling trainer rationality, we can enhance high-quality feedback collection, potentially leading to better-aligned models and an improved understanding of human preferences.
Submission Number: 78
Loading