Where’s the Plan? Locating Latent Planning in Language Models with Lightweight Mechanistic Interventions

08 May 2026 (modified: 09 May 2026)ICML 2026 Workshop CoLoRAI SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: mechanistic interpretability, interpretability, planning, activation patching, steering
TL;DR: We introduce a lightweight mechanistic framework for pinpointing latent planning sites in language models, showing that future constraints are often encoded but only sometimes causally used.
Abstract: We study $\textit{planning site formation}$ in language models---$\textit{where}$ internal representations of structurally-constrained future tokens form during the forward pass, and whether they causally drive generation. Using rhyming-couplet completion as a clean test of forward-looking constraint, we apply two lightweight methods (linear probing and activation patching) across Qwen3, Gemma-3, and Llama-3 at more than ten scales. Probing shows that future-rhyme information is linearly decodable at the line boundary, with signal that strengthens with scale in all three families. Activation patching reveals that only Gemma-3-27B causally relies on this encoding, exhibiting a $\textit{handoff}$ in which the causal driver migrates from the rhyme word to the line boundary around layer 30. Every other model we test conditions on the rhyme word throughout generation, with near-zero causal effect at the line boundary despite strong probe signal. We localize the Gemma-3-27B handoff to five attention heads through two-stage path patching that recover ~${90}$% of the rhyme-routing capacity at the newline.
Submission Number: 125
Loading