When Representations Persist but Control Fails: A Mechanistic Analysis of Search in Language Models

TMLR Paper7016 Authors

14 Jan 2026 (modified: 14 May 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Why do language models fail at multi-step reasoning despite encoding task-relevant structure? We investigate this question through graph traversal and report a temporal dissociation: models encode graph-theoretic structure with high fidelity, with Spearman rho = 0.50 to 0.70, yet fail at full-path autonomous multi-step execution on graphs with seven or more nodes. We observe zero successful completions across seven prompting regimes, including few-shot breadth-first and depth-first demonstrations, algorithm-conditioned prompts, structured JSON state updates, self-consistency, and tree-of-thought. The same models produce valid three-step prefixes on 55% to 75% of trials before late-step control collapse. Program-of-thought, the only regime that delegates execution to a Python interpreter, achieves partial success on simple instances. Both observations are consistent with a control-window account rather than absent competence. In 78% of failed trials, internal state drift occurs before the first invalid output. This temporal pattern, together with three classes of causal intervention, provides interventional evidence that control collapse contributes causally to behavioral failure rather than merely accompanying it. Representations persist beyond failure and remain structurally intact even as execution breaks down. When execution is externalized to a symbolic planner, performance recovers to 50% to 100%, and models correctly reject 92% of structurally invalid candidate paths, confirming preserved evaluative competence. Using SearchEval, a diagnostic lens that triangulates behavioral traces, representational geometry, and attention dynamics, we localize the bottleneck to attention-based control mechanisms that progressively decouple from task-relevant state during generation. We then validate this localization with three interventions: attention patching from successful early steps into failing later steps raises valid-transition rates from 21% to 47%; zero-ablation of the top 5% of state-attending heads drops short-horizon valid-transition rates from 78% to 31%; and adding the path-membership function vector lifts valid-transition rates by 12 percentage points. Taken together, the observational and interventional evidence is consistent with an account in which control instability, rather than representational inadequacy, is the binding constraint. These findings suggest that architectural innovations targeting state persistence, and not merely scaling, may be necessary for reliable algorithmic reasoning.
Submission Type: Long submission (more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=nsyGO59QWR&noteId=nsyGO59QWR
Changes Since Last Submission: The previous submission contained a small number of formatting artifacts introduced during compilation, resulting in unresolved reference placeholders (“??”) in the manuscript. These have now been fully corrected. Specifically, all figure, table, and equation references have been verified to resolve correctly, and the manuscript has been recompiled to ensure there are no broken cross-references or missing artifact links. No substantive changes to the technical content, experiments, or conclusions were made.
Assigned Action Editor: ~Ali_Ramezani-Kebrya1
Submission Number: 7016
Loading