Abstract: Reward learning techniques enable machine learning systems to learn objectives from human
feedback. A core limitation of these systems is their assumption that all feedback comes from
a single human teacher, despite gathering feedback from large and heterogeneous populations.
We propose the Hidden Utility Bandit (HUB) framework to model differences in teacher
rationality, expertise, and costliness, formalizing the problem of learning from multiple
teachers. We develop a variety of solution algorithms and apply them to two real-world
domains: paper recommendation systems and COVID-19 vaccine testing. We find that Active
Teacher Selection (ATS) algorithms outperform baselines by actively selecting when and which
teacher to query. Our key contributions are 1) the HUB framework: a novel mathematical
framework for modeling the teacher selection problem, 2) ATS: an active-learning based
algorithmic approach that demonstrates the utility of modeling teacher heterogeneity, and
3) proof-of-concept application of the HUB framework and ATS approaches to model and
solve multiple real-world problems with complex trade-offs between reward learning and
optimization.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Junpei_Komiyama1
Submission Number: 6858
Loading