Continuous Chain of Thought Enables Parallel Exploration and Reasoning

Published: 10 Jun 2025, Last Modified: 15 Jul 2025MOSS@ICML2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: chain-of-thought, latent space reasoning, parallel exploration, transformers, policy optimization, multi token sampling
TL;DR: This paper establishes theoretical benefits of chain-of-thought with continuous tokens (CoT2) and introduces new supervision and policy optimization strategies to train CoT2 models.
Abstract: We propose CoT2, a framework using continuously-valued tokens that enables language models to track multiple reasoning paths in parallel and provide a novel CoT2 supervision strategy where we match the softmax outputs to the empirical token distributions of a set of target traces. Theoretically, we show that CoT2 offers sample-complexity benefits and construct a one-layer transformer that solves the subset-sum problem with sufficient embedding capacity. We also introduce continuous sampling methods, showing that reinforcement learning with CoT2 notably improves logical reasoning performance compared to discrete and continuous baselines.
Code: ipynb
Submission Number: 65
Loading