Abstract: As AI systems advance beyond human capabilities, scalable oversight becomes critical: how can we supervise AI that exceeds our abilities? A key challenge is that human evaluators may form incorrect beliefs about AI behavior in complex tasks, leading to unreliable feedback for learning human preferences. We propose modeling human evaluators’ beliefs about AI systems to better interpret their feedback and infer their underlying values. We formalize human belief models, analyze their theoretical role in value inference, and characterize when there is remaining ambiguity in this inference. To reduce reliance on exact belief models, we introduce “human belief model covering” as a relaxation and make a preliminary proposal to use foundation models to construct such covering models. Our work demonstrates that modeling human beliefs about AI behavior could improve value inference from human feedback and outlines practical research directions for implementing this approach to scalable oversight.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: We:
- Rewrote the abstract to increase clarity;
- Removed the notation for the span;
- Added citations on prior belief modeling approaches to the introduction;
- Made various small changes to make clear that our work can be understood independently from Lang et al. (2024);
- Added sentences at the start of Section 2.2 to explain why we work with MDPs;
- Added Remark 2.2 to clarify the linearity assumption;
- Added Remark 2.3 to clarify our relationship to other work involving partial observations;
- Added Section 2.5 to clarify the relevance of our problem setting to scalable oversight;
- Added elaborations to Section 3.2 to explain morphisms and how ontology translations appear in the theory;
- Added Section 3.4.4 on Limitations of the practical proposal;
- Added a citation on IRL in partially observable environments to Section 4.2;
- Rewrote the conclusion.
Assigned Action Editor: ~Branislav_Kveton1
Submission Number: 4383
Loading