Keywords: Reward Modeling, Response Time, RLHF, Preference Learning, Preference Strength, Learning from Rankings, Data Efficiency
TL;DR: ResponseRank enables data-efficient learning of distance-aware reward models through stratified comparison strength rankings.
Abstract: Binary choices, as often used for reinforcement learning from human feedback (RLHF), convey only the *direction* of a preference.
A person may choose apples over oranges and bananas over grapes, but *which preference is stronger*?
Strength is crucial for decision-making under uncertainty and generalization of preference models, but hard to measure reliably.
Metadata such as response times and inter-annotator agreement can serve as proxies for strength, but are often noisy and confounded.
We propose ResponseRank to address the challenge of learning from noisy strength signals.
Our method uses relative differences in these signals to *rank responses to pairwise comparisons by their inferred preference strength*.
Signals are only considered locally within carefully constructed strata, controlling for systemic variation.
This enables robust learning of utility differences consistent with strength-derived rankings, all while making minimal assumptions.
Our contributions are threefold:
(1) ResponseRank, a novel method that robustly learns preference strength by leveraging locally valid relative strength signals;
(2) empirical evidence of improved sample efficiency and robustness across diverse tasks: synthetic preference learning (with simulated response times), language modeling (with annotator agreement), and RL control tasks (with simulated episode returns); and
(3) the *Pearson Distance Correlation (PDC)*, a novel metric that isolates cardinal utility learning from ordinal accuracy.
Supplementary Material: zip
Primary Area: General machine learning (supervised, unsupervised, online, active, etc.)
Submission Number: 23475
Loading