Learning to Correct: Reinforcement Learning for Multi-Attempt Chain-of-Thought

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We propose CAL, a calibrated attempt-level weighting for GRPO that gives unbiased, lower-variance gradients for multi-attempt CoT with hard verifier feedback, improving Verification@K over trajectory-level and naive attempt-level baselines.
Abstract: State-of-the-art reasoning models utilize long chain-of-thought (CoT) to solve increasingly complex problems using more test-time computation. In this work, we explore a long CoT setting where the model makes up to K successive attempts at solving a problem, in which each attempt is allowed to build on earlier ones after the model receives a hard verifier feedback. This motivates RL methods that can harness per-attempt rewards by carefully weighting individual attempts. We study optimizing the Verification@K reward (the model succeeds by the K-th attempt) and show that naively weighing the attempts by their pass/fail results in biased gradients. We introduce Calibrated Attempt-Level (CAL) GRPO by devising a weighing strategy to obtain unbiased gradients while maintaining small variance. Our theory reveals how incorporating per-attempt rewards influences the training and the eventual Verification@K performance. Experiments, baselines, and ablations on synthetic and real data corroborate our theory and the benefits of CAL-GRPO over vanilla GRPO as well as naive weighting.
Lay Summary: Modern AI systems often solve hard problems by trying more than once: they make an attempt, receive feedback about whether it was correct, and then try again if needed. This paper studies how to train AI models to use these repeated attempts more effectively. We propose a calibrated training method that balances these two goals. It gives feedback to individual attempts while accounting for how much each attempt contributes to eventually solving the problem. This helps the model learn both to answer correctly earlier and to improve after receiving failure feedback. We provide theoretical analysis showing why this calibration gives a better training signal, and we test the method on math problems, maze navigation, and controlled planning tasks. Across these settings, the calibrated method improves the model’s ability to solve problems within multiple attempts compared to standard and naive training approaches.
Originally Submitted Supplementary Material: zip
Link To Code: https://github.com/alperengozeten/learning-to-correct
Primary Area: Theory->Reinforcement Learning and Planning
Keywords: Reinforcement Learning, GRPO, Verification, Multi-Attempt Reasoning
Originally Submitted PDF: pdf
Submission Number: 28252
Loading