Unrolled Policy Iteration for Tiny Recursive Models

Published: 05 Mar 2026, Last Modified: 11 Mar 2026ICLR 2026 Workshop RSI PosterEveryoneRevisionsCC BY 4.0
Keywords: Recursive Self-Improvement, Tiny Recursive Models, Reinforcement Learning, Verifiable Rewards, Policy Iteration, Plan-Space MDP, Latent Reasoning, Constraint Satisfaction, Conservative Policy Improvement
TL;DR: We enable recursive self-improvement from checker feedback by formalizing Tiny Recursive Model training as a plan-space MDP with stability bounds for truncated internal evaluation.
Abstract: We study recursive self-improvement via internal evaluators trained from verifier feedback, focusing on stability when recursive compute is scaled at test time. In plan-editing models such as Tiny Recursive Models (TRMs), an inner evaluator is unrolled for $n$ recurrent steps to produce a value estimate; the unroll depth~$n$ is an architectural compute budget. We formalize plan editing as a Markov decision process and analyze approximate policy iteration with truncated internal evaluation. Under a contraction assumption on the inner recursion---requiring the latent update map to have Lipschitz constant $L_z < 1$---we decompose value error into an architectural Bellman-residual term plus a finite-unrolling bias that decays geometrically as $L_z^{\,n}$ with depth. For conservative mixture updates that blend the current policy with a candidate at mixing weight~$\alpha$, we show that with statewise-centered advantage estimates, evaluation error enters the improvement bound scaled by~$\alpha$ rather than at full strength, reducing sensitivity to imperfect evaluation. Experiments on Sudoku (4${\times}$4 and 9${\times}$9) trained solely from checker feedback validate feasibility and show that contraction strength and latent projection modulate stability under depth mismatch between training and evaluation. The analysis identifies the contraction modulus and unroll depth as practical stability controls for policy improvement under truncated internal computation.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 45
Loading