Large Language Models as Improvement Operators: Better Reasoning by Iteration

Lovish Madaan; Aniket Rajiv Didolkar; Suchin Gururangan; John Quan; Ruan Silva; Ruslan Salakhutdinov; Manzil Zaheer; Sanjeev Arora; Anirudh Goyal

Large Language Models as Improvement Operators: Better Reasoning by Iteration

Lovish Madaan, Aniket Rajiv Didolkar, Suchin Gururangan, John Quan, Ruan Silva, Ruslan Salakhutdinov, Manzil Zaheer, Sanjeev Arora, Anirudh Goyal

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Self Improvement, Reinforcement Learning

TL;DR: Scaling train and test time compute using self improvement

Abstract: Reasoning training of LLMs teaches them to produce long chains of thought (long CoT), with attendant increase in accuracy (good!), context length (bad!), compute cost (bad!) and answer latency (bad!). Can current models provide other combinations on this Pareto frontier, e.g., give answers with better accuracy than long CoT models, despite lower context length and/or latency? We show that the answer is ``yes'' and also give ways to systematically think about these design choices. Abstractly, this involves viewing the model as an \emph{improvement operator} with a continuum of strategies for improving its problem solving, (example: generate four shorter answers and combine their good points in a single superior answer). We study an inference method \textbf{Parallel-Distill-Refine (\PDR)} that performs a few rounds of the following: (i) generate diverse drafts in parallel; (ii) \emph{distill} them into a bounded, textual \emph{workspace}; and (iii) \emph{refine} conditioned on this workspace, which then seeds the next round. PDR often provides better performance than long CoT and has lower latency and context size. An interesting subcase of PDR is \textbf{Sequential Refinement (\SR)}, which iteratively improves a single candidate answer without a persistent workspace. It provides performance superior to long CoT, with the benefit of compact context size but high latency. These examples suggest training interventions to shift the Pareto frontier. For example, we use RL to improve an $8B$ model to better align with \PDR\ as the inference method, which improves performance. On math tasks with rule-based checkers, iterative pipelines surpass single-pass baselines at matched sequential budget; shallow \PDR\ delivers the largest gains (e.g., +10\% on AIME 2024 and +11\% on AIME 2025).

Primary Area: foundation or frontier models, including LLMs

Submission Number: 7948

Loading