Compress to Think, Decompress to Speak: Dual-Mode Reasoning in Transformers

Compress to Think, Decompress to Speak: Dual-Mode Reasoning in Transformers

ICLR 2026 Conference Submission22271 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLMs, Latent reasoning, inference

TL;DR: Training paradigm for transformers to reason in both latent and language space.

Abstract: Latent reasoning has emerged as an alternative to reasoning with natural language and involves feeding back the last layer's hidden state representation (soft token) to the input of the transformer. This idea is promising, since soft tokens have increased expressive capacity compared to tokens from the vocabulary (hard tokens). Existing works on training transformers with soft tokens often suffer from performance loss and do not allow for sampling of different reasoning traces. We propose a training paradigm for transformers that uses soft tokens, in which the model learns to operate in two modes; one that processes the soft tokens (latent thinking mode) and one that decompresses the soft tokens into few reasoning steps with hard tokens from the vocabulary (local decoding mode). We focus on logical and math reasoning tasks, and fine-tune pretrained models of different size. Our method achieves similar or better performance, compared to supervised fine-tuning with chain-of-thought data across all tasks; while it requires reduced KV cache and allows sampling different reasoning traces at inference.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 22271

Loading