Modeling Human Beliefs about AI Behavior for Scalable Oversight

Published: 18 Aug 2025, Last Modified: 18 Aug 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: As AI systems advance beyond human capabilities, scalable oversight becomes critical: how can we supervise AI that exceeds our abilities? A key challenge is that human evaluators may form incorrect beliefs about AI behavior in complex tasks, leading to unreliable feedback and poor value inference. To address this, we propose modeling evaluators' beliefs to interpret their feedback more reliably. We formalize human belief models, analyze their theoretical role in value learning, and characterize when ambiguity remains. To reduce reliance on precise belief models, we introduce "belief model covering" as a relaxation. This motivates our preliminary proposal to use the internal representations of adapted foundation models to mimic human evaluators' beliefs. These representations could be used to learn correct values from human feedback even when evaluators misunderstand the AI's behavior. Our work suggests that modeling human beliefs can improve value learning and outlines practical research directions for implementing this approach to scalable oversight.
Submission Length: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Branislav_Kveton1
Submission Number: 4383
Loading