Unlocking the Black Box of Latent Reasoning: An Interpretability-Guided Approach to Intervention

Unlocking the Black Box of Latent Reasoning: An Interpretability-Guided Approach to Intervention

ACL ARR 2026 January Submission1192 Authors

28 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Language Modeling, Interpretability and Analysis of Models for NLP, Generation

Abstract: Latent reasoning enables Large Language Models (LLMs) to perform multi-step inference within continuous hidden states, offering efficiency gains over explicit Chain-of-Thought (CoT). However, the opacity of these continuous thought vectors hinders their reliability and controllability. This paper bridges the gap between mechanistic interpretability and actionable control. We first present a systematic analysis using structural, causal, and geometric probes, revealing that latent vectors encode compressed, faithful representations of reasoning steps, with early vectors acting as critical causal hubs. Building on this, we operationalize these interpretability insights into a suite of training-free, decode-time interventions that refine the latent reasoning process by imposing the identified geometric and semantic priors. Extensive experiments across multiple model scales and diverse task domains demonstrate that our approaches consistently improve reasoning accuracy. Our interpretability-guided interventions consistently unlock latent capabilities and improve reasoning accuracy without any parameter updates.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: Language Modeling, Interpretability and Analysis of Models for NLP, Generation

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Theory

Languages Studied: English

Submission Number: 1192

Loading