Mitigating Legibility Tax with Decoupled Prover-Verifier Games

Published: 02 Mar 2026, Last Modified: 03 Mar 2026ICLR 2026 Trustworthy AIEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Prover-Verifier Games, Scalable Oversight
TL;DR: We eliminate the legibility tax in prover–verifier games by decoupling correctness from checkability, training a solver for accuracy and a separate translator that rewrites the solver’s solution into a checkable proof.
Abstract: As large language models become increasingly capable, it is critical that their outputs can be easily checked by less capable systems. Prover-verifier games can be used to improve checkability of model outputs, but display a degradation in accuracy compared to a baseline trained only to maximize correctness---a phenonemon called *legibility tax* (Kirchner et al., 2024). We propose a solution by decoupling the correctness from the checkability condition and instead training a "translator" model that turns a fixed solver model's solution into a checkable form. This allows us to first train the solver to maximize correctness, and then train the translator to translate the solver into a checkable form while retaining the solver's answer. To accommodate this new objective of translation, we formulate a *decoupled prover-verifier game* where the equilibria correspond to faithful and checkable translators.
Submission Number: 286
Loading