Keywords: Explainable AI, Transformer models, Neural interpretability, Co-activation graphs, Knowledge extraction
Abstract: Transformer models, despite their exceptional capabilities in Natural Language Processing (NLP) and Vision tasks, like deep neural network models, often function as "black boxes" as their internal processes remain largely opaque due to their complex architectures. This work extends graph-based knowledge extraction techniques, previously applied to CNNs, to the domain of Transformer models.
The inner mechanics of Transformer models are explored by constructing a co-activation graph from their encoder layers. The nodes of the graph represent the hidden unit within each encoder layer, while the edges represent the statistical correlations between these hidden units. The magnitude of co-activation, which is the correlation between activations of two hidden units, determines the strength of their connection within the graph.
Our research is focused on encoder-only Transformer classifiers. We conducted experiments involving a custom-built Transformer and a pre-trained BERT model for an NLP task. We used graph analysis to detect semantically related class clusters and their impact on misclassification patterns. We demonstrate a positive correlation between class similarity and the frequency of classification errors. Our findings suggest that co-activation graphs reveal structured, interpretable representations in Transformers, consistent with prior CNN findings on knowledge extraction.
Track: Neurosymbolic Methods for Trustworthy and Interpretable AI
Paper Type: Long Paper
Resubmission: Yes
Changes List: Limitations Section Added: We have included a brief but focused Limitations section outlining the current constraints of our method, such as scope of datasets, graph construction sensitivity, and computational overhead.
Dataset Citations Added: Proper references and links for the 20-Newsgroups and CIFAR datasets have been added to improve reproducibility and attribution.
Typos Fixed: Frequent formatting and spacing issues (e.g., missing spaces after periods) have been corrected throughout the paper for better readability.
Sentence Rephrasing: Several reviewer-suggested sentences were revised for clarity and precision—particularly around pruning, class similarity, and the core contributions in the introduction.
Recent Related Work Added: We incorporated 2025 papers on mechanistic interpretability and causal graph extraction from Transformers to reflect the latest developments in the field.
Mathematical Formula Corrections: Fixed the weighted Jaccard coefficient formula in Section 4.2.1, changing Σ_{x∈n(a)} x to Σ_{x∈n(a)} w(x,a) as identified by Reviewer keLN.
Community Detection Algorithm Citations: Added proper citations for Louvain (Blondel et al., 2008) and Leiden (Traag et al., 2019) algorithms.
More recent papers: We incorporated 2025 papers on mechanistic interpretability and causal graph extraction from Transformers, including Marks et al.'s work on sparse feature circuits for discovering interpretable causal subnetworks in language models, to reflect the latest developments in the field.
Page Limit Constraints: Some additional suggestions—such as deeper architectural comparisons, more ablation results, and expanded visualization—were considered but not fully implemented due to page limits. These will be addressed in a forthcoming extended version currently under development.
Publication Agreement: pdf
Submission Number: 89
Loading