Limits of Continuous Chain-of-Thought in Multi-Step and Multi-Chain Reasoning

Published: 15 Nov 2025, Last Modified: 08 Mar 2026AAAI 2026 Bridge LMReasoningEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models (LLM), Chain-of-Thought (CoT), Continuous Reasoning, Implicit Reasoning
Abstract: Chain-of-thought (CoT) improves LLM reasoning but can be slow and inefficient because it relies on discrete token generation. Recent implicit (answer-only) and continuous (latent-step) alternatives aim to reduce these costs, but evaluations so far have focused on simple benchmarks. We study how these models perform under two key challenges of compositional reasoning: robustness to increased chain length and the ability to exploit multiple valid reasoning paths. To probe these axes, we introduce a MatMul benchmark with controllable depth and associativity-induced diversity, and a 100k-sample GSM8K corpus with 2-5 distinct chains per question. Across both settings, discrete CoT remains robust, while continuous and implicit models degrade as chains lengthen and fail to leverage reasoning diversity. We trace these failures to curriculum and alignment objectives that provide weak supervision for intermediate reasoning steps and can collapse diverse traces into a single representation. These results point to the need for training objectives that deliver intermediate credit assignment and preserve reasoning diversity in latent models.
Submission Number: 34
Loading