Investigating the limits of free-form debate as a scalable oversight strategy

Investigating the limits of free-form debate as a scalable oversight strategy

23 Apr 2026 (modified: 24 Apr 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Debate is a scalable oversight method involving two copies of a strong model trained to defend alternative responses to a question, with a weaker judge evaluating which answer is better supported. We replicate and extend a result from prior work demonstrating that training Llama3-8B-Instruct-262k as a debater led to increased performance of a GPT-4-class judge model on QuALITY, a question-answering task that grants the debaters a capability advantage via information asymmetry. When replicating the original setup as closely as possible, we confirm that training debater models in free-form, multi-round debate increased judge accuracy. However, this finding did not generalize across alternative tasks or models, and did not replicate consistently under our closest approximation to the original setting. These results suggest that the effectiveness of free-text debate as a scalable oversight method is sensitive to task structure, model pairing, and training conditions, and highlight the need for greater understanding of when and why debate improves judge accuracy. We identify several factors that may influence debate's success and outline directions for future work aimed at characterizing the conditions under which debate strengthens oversight reliably.

Submission Type: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Masashi_Sugiyama1

Submission Number: 8575

Loading