Continuous Chain of Thought Enables Parallel Exploration and Reasoning

Halil Alperen Gozeten; Muhammed Emrullah Ildiz; Xuechen Zhang; Hrayr Harutyunyan; Ankit Singh Rawat; Samet Oymak

Continuous Chain of Thought Enables Parallel Exploration and Reasoning

Halil Alperen Gozeten, Muhammed Emrullah Ildiz, Xuechen Zhang, Hrayr Harutyunyan, Ankit Singh Rawat, Samet Oymak

Published: 26 Jan 2026, Last Modified: 11 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: chain-of-thought, latent space reasoning, parallel exploration, transformers, policy optimization, multi token sampling

TL;DR: We establish theoretical benefits of chain-of-thought with continuous tokens and introduce new supervision and policy optimization strategies.

Abstract: Modern language models generate chain-of-thought traces by autoregressively sampling tokens from a finite vocabulary. While this discrete sampling has achieved remarkable success, conducting chain-of-thought with continuously-valued tokens (CoT2) offers a richer and more expressive alternative. Our work provides new theoretical guarantees and algorithms for CoT2, motivated by logical reasoning tasks that inherently require search capabilities. Theoretically, we establish how CoT2 facilitates the model to track multiple discrete traces in parallel; and quantify the level of achievable parallelism and its benefits for inference efficiency. We also provide a CoT2-based one-layer transformer construction that solves the combinatorial ``subset sum problem'' given a sufficient embedding dimension. These insights arise from a novel and effective supervision strategy where we match the language model outputs to the empirical token distributions of a set of target traces. Complementing this, we introduce sampling strategies that unlock policy optimization methods for CoT2. Our primary strategy samples and composes $K$ discrete tokens at each decoding step to control the level of parallelism. Experiments confirm that (i) the optimal level of parallelism is governed by the embedding dimension, (ii) our continuous supervision strategy can outperform alternative methods, and (iii) policy optimization with CoT2 indeed improves the performance of the model beyond its initial discrete or continuous supervision.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 22632

Loading