Mechanistic Interpretability of LLMs through Network Science

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Mechanistic interpretability, Large language models, Network science, Graph theory, Emergent abilities
Abstract: Understanding scaling laws and emergent abilities in Large Language Models (LLMs) remains a key challenge for interpretability. While much prior work in mechanistic interpretability has focused on learned representations, the attention matrix—which governs information flow—has received much less attention. Fur- thermore, the analysis of the attention matrix from a theoretical network science perspective has also not been done. In this work, we present a pipeline for dynamic graph construction from attention matrices, introduce a novel head aggregation technique based on entropy, and analyse the attention graphs from a network sci- ence perspective to draw interpretability insights. Our experiments show that the entropy-based head aggregation preserves attention details, and that key graph metrics—specifically the clustering coefficient and maximum pagerank—correlate with improved model correctness and emergent abilities in LLMs. Notably, our findings indicate that larger models exhibit higher maximum pagerank and lower clustering coefficients, suggesting they reason differently by attending more glob- ally and selectively focusing on key hotspots.
Primary Area: interpretability and explainable AI
Submission Number: 23600
Loading