Emergent Reasoning via Recursive Latent Reinforcement Pretraining

Gopeshh Subbaraj; Istabrak Abbes; Artem Zholus; Matthew Riemer; Irina Rish; Sarath Chandar

Emergent Reasoning via Recursive Latent Reinforcement Pretraining

Gopeshh Subbaraj, Istabrak Abbes, Artem Zholus, Matthew Riemer, Irina Rish, Sarath Chandar

Published: 05 Mar 2026, Last Modified: 25 Apr 2026ICLR 2026 Workshop LLM ReasoningEveryoneRevisionsBibTeXCC BY 4.0

Track: long paper (up to 10 pages)

Keywords: pretraining, RL, deep supervision, reasoning

Abstract: Large language models (LLMs) often rely on explicit chain-of-thought (CoT) traces to solve multi-step reasoning problems, but these traces increase inference cost, expose brittle prompt dependence, and complicate training objectives. We study an alternative: \emph{latent deliberation} implemented as a small recurrent refinement module that performs multiple internal ``thinking`` steps while keeping the external sequence length fixed. We introduce \textbf{Recursive Latent Reinforcement Pretraining (RLRP)}, a training recipe that augments a base causal LLM with a shared latent head executed for $K$ refinement steps on \emph{every token}. The head updates a latent state via bounded residual iterations and projects it back to the hidden space to produce step-wise logits. Training combines (i) deep supervision with a convex combination of per-step next-token cross-entropies, (ii) data-aware routing that interleaves reasoning-focused and fluency-focused batches, and (iii) soft reinforcement learning on reasoning batches that maximizes the model's probability mass on the ground-truth next token, optionally restricted to answer spans. We additionally consider an ``improvement penalty`` that encourages later refinement steps to outperform the first step. Our approach is simple, compatible with standard autoregressive LMs and distributed training, and focuses on iterative latent refinement without increasing output tokens.

Presenter: ~Istabrak_Abbes1

Format: Yes, the presenting author will attend in person if this work is accepted to the workshop.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Funding: Yes, the presenting author of this submission falls under ICLR’s funding aims, and funding would significantly impact their ability to attend the workshop in person.

Submission Number: 85

Loading