A Benchmark for Scalable Oversight Mechanisms

A Benchmark for Scalable Oversight Mechanisms

ICLR 2025 Workshop BuildingTrust Submission110 Authors

11 Feb 2025 (modified: 06 Mar 2025)Submitted to BuildingTrustEveryoneRevisionsBibTeXCC BY 4.0

Track: Long Paper Track (up to 9 pages)

Keywords: scalable oversight, debate. alignment

TL;DR: We built a benchmark for scalable oversight/human feedback mechanisms

Abstract: As AI agents surpass human capabilities, scalable oversight -- the problem of effectively supplying human feedback to potentially superhuman AI models -- becomes increasingly critical to ensure alignment. While numerous scalable oversight mechanisms have been proposed, they lack a systematic empirical framework to evaluate and compare them. While recent works have tried to empirically study scalable oversight mechanisms -- particularly Debate -- we argue that they contain methodological flaws that limit their usefulness to AI alignment. We introduce the scalable oversight benchmark, a principled framework for evaluating human feedback mechanisms based on our agent score difference (ASD) metric, a measure of how effectively a mechanism advantages truth-telling over deception. We supply a Python package to facilitate rapid and competitive evaluation of scalable oversight protocols on our benchmark, and conduct some a demonstrative experiment benchmarking Debate.

Submission Number: 110

Loading