Learning Efficient Latent Reasoning with Abstract Chain-of-Thought

Keshav Ramji; Tahira Naseem; Ramón Fernandez Astudillo

Learning Efficient Latent Reasoning with Abstract Chain-of-Thought

Keshav Ramji, Tahira Naseem, Ramón Fernandez Astudillo

Published: 02 Mar 2026, Last Modified: 06 Apr 2026LIT Workshop @ ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Track: long paper (up to 10 pages)

Keywords: latent reasoning, reinforcement learning, abstract vocabulary, chain-of-thought, efficient reasoning

TL;DR: We train language models to reason using a reserved vocabulary through an iterative warm-up and reinforcement learning, achieving inference-time efficiency gains of up to 12x fewer tokens.

Abstract: While long, explicit chains-of-thought (CoT) have proven effective on complex reasoning tasks, they are costly to generate at inference-time. Recent latent reasoning methods have focused on leveraging continuous representations, yet have primarily focused on pre-training and do not demonstrate efficacy at scale, lagging behind verbalized CoT. We propose $\textbf{Abstract Chain-of-Thought}$, a discrete latent reasoning post-training mechanism in which the language model produces a short sequence of tokens from a reserved codebook in lieu of a natural language CoT, prior to generating a response. To make previously unseen “abstract” tokens useful, we introduce a policy iteration-style warm-up loop that alternates between (i.) bottlenecking from a verbal CoT via masking and performing supervised fine-tuning, and (ii.) self-distillation by training the model to generate abstract tokens from the prompt alone via constrained decoding from the codebook. After warm-up, we optimize the generation of abstract sequences with warm-started reinforcement learning under constrained decoding. For the Qwen3-8B and Granite-3.3-8B language models, Abstract-CoT achieves up to $12\times$ fewer reasoning tokens while achieving comparable performance across mathematical reasoning, instruction-following, and multi-hop reasoning in natural language. We also find emergent non-uniform usage over the codebook, resembling power law dynamics seen in pre-training. Our findings highlight the potential for post-training recipes that facilitate latent reasoning for inference efficiency while inducing new representational dynamics through the introduction of a new “reasoning language”.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Submission Number: 86

Loading