The Information Bottleneck of Chain-of-Thought and How Latent CoT Overcomes It

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Chain-of-Thought, Latent CoT, Large language model
Abstract: Chain-of-thought (CoT) has become the de facto paradigm for large language models (LLMs) to solve complex reasoning tasks. However, due to the sequential nature of token generation, the inference time can be formidable if the CoT is exceedingly long. This paper identifies a fundamental \emph{information bottleneck} that can cause the CoT to be long: although each forward pass can activate a vast amount of neurons, in the end, the information the model writes down is limited to a single token, making it inevitable to produce many more CoT steps than necessary. We first theoretically establish this bottleneck by showing that for some natural problems, such as pointer chasing and computing parity, either 1-layer transformers or constant-layer finite-precision transformers require a rather long CoT to solve. We then demonstrate that for these same problems, allowing the Transformer to write high-dimensional embeddings to the CoT (i.e., using latent CoT) significantly reduces the CoT length, establishing a provably theoretical benefit for using latent CoT. We further validate our theory with controlled experiments: training a small transformer to simulate Conway’s Game of Life with latent CoT, we vary the per-step write bandwidth to the latent CoT and observe a sharp success threshold proportional to the board size.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 23210
Loading