Learning to Reason over Continuous Tokens with Reinforcement Learning

Learning to Reason over Continuous Tokens with Reinforcement Learning

ICLR 2026 Conference Submission20575 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Hybrid Reasoning, Reinforcement Learning

Abstract: Large Language Models (LLMs) have shown strong performance in complex reasoning tasks, especially when guided by Chain-of-Thought (CoT) prompting. However, conventional CoT reasoning in the discrete token space suffers from high computational and memory costs due to verbose intermediate steps. Recent work has explored latent reasoning in the embedding space to improve efficiency, but often at the cost of clarity and performance. In this work, we propose $\underline{Hy}$brid $\underline{Rea}$soning ($\texttt{HyRea}$), a unified framework that enables LLMs to dynamically switch between explicit (token-based) and latent (embedding-based) reasoning during inference. To train the model to make these decisions effectively, we introduce a two-stage training pipeline: (1) a supervised cold-start phase that introduces latent reasoning by replacing low-entropy CoT steps with embeddings, and (2) a reinforcement learning phase using Group Relative Policy Optimization (GRPO) to fine-tune the model’s reasoning strategy based on task-specific rewards. Experiments on mathematical reasoning benchmarks show that \texttt{HyRea} achieves significant reductions in token usage while maintaining or improving accuracy, offering an effective and scalable solution for efficient multi-step reasoning in LLMs.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 20575

Loading