Are Latent Reasoning Models Easily Interpretable?

Connor Dilgren; Sarah Wiegreffe

Are Latent Reasoning Models Easily Interpretable?

Connor Dilgren, Sarah Wiegreffe

Published: 02 Mar 2026, Last Modified: 18 Mar 2026LIT Workshop @ ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Track: long paper (up to 10 pages)

Keywords: latent reasoning, interpretability, mechanistic interpretability

TL;DR: We find that latent reasoning models often don't use their reasoning tokens and, when they do, gold reasoning traces can be decoded from them via vocabulary projection.

Abstract: Latent reasoning models (LRMs) have recently attracted significant research interest due to their lower inference cost compared to explicit chain-of-thought approaches. However, this efficiency comes at the cost of reduced interpretability: LRMs are more difficult for humans to monitor because they do not reason in natural language. This paper presents an initial investigation into LRM interpretability by examining two key questions on the Coconut and CODI models. First, we test the assumption made in prior work that latent reasoning tokens are necessary for model performance. We find that latent reasoning tokens are often unnecessary for LRMs' predictions; on logical reasoning datasets, LRMs can produce the same final answers without using latent reasoning tokens at all. This underutilization of reasoning tokens may partially explain why LRMs do not consistently outperform explicit reasoning methods, and raises doubts about the role of these tokens proposed in prior work. Second, when latent reasoning tokens are necessary, we investigate whether we can easily decode gold reasoning traces from them as a form of natural language explanation. Using a proposed backtracking method, we decode gold reasoning traces from latent reasoning tokens 71-93% of the time for correctly predicted instances when operands from the question are included, but only 24-36% of the time for incorrect predictions. This suggests that for correct predictions, LRMs are implementing the expected solution rather than an uninterpretable reasoning process. We find preliminary evidence that incorrect predictions can also be interpreted in this manner, though more robust methods are needed to reliably decode reasoning traces when models do not implement the gold reasoning trace.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Submission Number: 55

Loading