Towards By-Design Interpretable Transformers via Modular Interpretability-Guided Alignment

Badr Ait Hammou; Danilo Bzdok

Towards By-Design Interpretable Transformers via Modular Interpretability-Guided Alignment

Badr Ait Hammou, Danilo Bzdok

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Deep Learning, Transformers, Explainable AI, Neuroscience, Large Language Models

Abstract: Transformer models offer strong predictive performance but generally lack interpretability, limiting their adoption in high-stakes applications such as neuroscience. Existing explainable AI methods tend to produce inconsistent and biologically ungrounded explanations, which reduce their usefulness in uncovering condition-specific mechanisms. In this paper, we propose an Interpretability-Guided Alignment module designed to enhance the explainability of pre-trained Transformer models by aligning their internal representations, weights with established external biological knowledge. We introduce a novel conditional interpretable layer and a block-wise interpretability mechanism that provide localized, human-understandable insights into model decisions. Experimental evaluation on two real-world Alzheimer’s disease datasets, namely Seattle and ROSMAP, demonstrates that our approach not only achieves strong classification accuracy but also uncovers biologically meaningful interpretations by identifying key pathways supported by external biological databases such as KEGG and WikiPathways, thereby outperforming existing baselines. Specifically, our solution achieves more than three times higher biological interpretability scores for the Alzheimer’s disease condition compared to existing methods. Furthermore, our approach has the potential to enhance the interpretability of other Transformer-based models across application domains when integrated with relevant external knowledge.

Primary Area: applications to neuroscience & cognitive science

Submission Number: 14680

Loading