Keywords: Online Learning Theory, Theory of LLMs, Chain-of-Thought Reasoning
Abstract: Large Language Models (LLMs) with chain-of-thought generation have demonstrated great potential for solving complex reasoning and planning tasks, though their output is not fully reliable and requires careful verification. Using learned verifiers can help increase trust, enforce safety constraints, and ensure alignment with personal preferences, yet training them is challenging because interactions between generator and verifier may induce substantial distribution shift. Motivated by this challenge, we propose a framework for online learning chain-of-thought verifiers that, given a problem and a sequence of reasoning steps, check the correctness of the solution. Highlighting the asymmetric role of soundness errors (not flagging incorrect reasoning) and completeness errors (rejecting correct reasoning), we introduce novel extensions of the Littlestone dimension which tightly characterize mistake bounds in the realizable setting. We provide optimal algorithms for finding the Pareto-frontier (the smallest total number of mistakes given a budget of soundness mistakes) as well as for minimizing a linear combination of mistake costs. We further show how our learned verifiers can be used to boost the accuracy of a collection of weak generators: under the mild assumption that one of the generators can generate the next reasoning step correctly with some minimal probability, we show how to learn a strong generator with small error and abstention rates.
Submission Number: 216
Loading