Keywords: expertise, rlhf, reward learning, preference learning
TL;DR: We proposa a way to learn context-dependent annotator expertise to improve reward learning from multiple annotators with varying expertise levels
Abstract: Reinforcement learning from human feedback (RLHF) has been used successfully to teach robots tasks that are difficult to specify procedurally.
However, feedback from human annotators can be suboptimal and noisy, decreasing accuracy and leading to potentially unsafe behavior.
Furthermore, different human annotators may have varying context-dependent expertise.
In this work, we study the feasibility of learning annotator expertise jointly with a reward model based on annotator feedback.
As opposed to prior works that assume human annotators are perfect or that their expertise levels are known, our method performs RLHF training without these assumptions by estimating the expertise of every annotator given information about annotator identities in the data.
We show that if annotators exhibit varying degrees of expertise, estimating annotator expertise improves the ranking accuracy of the learned reward functions.
When the annotator's expertise depends on the *context*, our method shows limited success.
Submission Number: 6
Loading