Automated Attention Pattern Discovery at Scale in Large Language Models

TMLR Paper5837 Authors

07 Sept 2025 (modified: 10 Feb 2026)Decision pending for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large language models have found their success by scaling up their capabilities to work in general settings. The same can unfortunately not be said for their interpretability methods. The current trend in mechanistic interpretability is to give exact explanations about specific behaviors in clinical settings. These often do not generalize well into other settings, or are too resource intensive for larger studies. In this work we propose to study repeated behaviors in large language models by mining completion scenarios in Java code datasets, through exploiting the structured nature of source code. We then collect the attention patterns generated in the attention heads to demonstrate that they are scalable signals for global interpretability of model components. We show that vision models offer a promising direction for analyzing attention patterns at scale. To demonstrate this, we introduce the Attention Pattern~-- Masked Autoencoder (AP-MAE), a vision transformer-based model that efficiently reconstructs masked attention patterns. Experiments on StarCoder2 models (3B–15B) show that AP-MAE (i) reconstructs masked attention patterns with high accuracy, (ii) generalizes across unseen models with minimal degradation, (iii) reveals recurring patterns across a large number of inferences, (iv) predicts whether a generation will be correct without access to ground truth, with an accuracy of 55\% to 70\% depending on the task, and (v) enables targeted interventions that increase accuracy by 13.6\% when applied selectively, but cause rapid collapse when applied excessively. These results establish attention patterns as a scalable signal for interpretability and demonstrate that AP-MAE provides a transferable foundation for both analysis and intervention in large language models. Beyond its standalone value, AP-MAE can also serve as a selection procedure to guide more fine-grained mechanistic approaches toward the most relevant components. We release code and models to support future work in large-scale interpretability.
Submission Length: Regular submission (no more than 12 pages of main content)
Code: https://github.com/AISE-TUDelft/AP-MAE
Assigned Action Editor: ~Vlad_Niculae2
Submission Number: 5837
Loading