Modeling Human Beliefs about AI Behavior for Scalable Oversight

TMLR Paper4383 Authors

28 Feb 2025 (modified: 18 Mar 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Contemporary work in AI alignment often relies on human feedback to teach AI systems human preferences and values. Yet as AI systems grow more capable, human feedback becomes increasingly unreliable. This raises the problem of scalable oversight: How can we supervise AI systems that exceed human capabilities? In this work, we propose to model the human evaluator's beliefs about the AI system's behavior to better interpret the human's feedback. We formalize human belief models and theoretically analyze their role in inferring human values. We then characterize the remaining ambiguity in this inference and conditions for which the ambiguity disappears. To mitigate reliance on exact belief models, we then introduce the relaxation of human belief model covering. Finally, we propose using foundation models to construct covering belief models, providing a new potential approach to scalable oversight.
Submission Length: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Branislav_Kveton1
Submission Number: 4383
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview