Keywords: Deep Learning, Transformers, Explainable AI, Neuroscience, Large Language Models
Abstract: Transformer models offer strong predictive performance but generally lack interpretability, limiting their adoption in high-stakes applications such as neuroscience. Existing explainable AI methods tend to produce inconsistent and biologically ungrounded explanations, which reduce their usefulness in uncovering condition-specific mechanisms. In this paper, we propose an Interpretability-Guided Alignment module designed to enhance the explainability of pre-trained Transformer models by aligning their internal representations, weights with established external biological knowledge. We introduce a novel conditional interpretable layer and a block-wise interpretability mechanism that provide localized, human-understandable insights into model decisions. Experimental evaluation on two real-world Alzheimer’s disease datasets, namely Seattle and ROSMAP, demonstrates that our approach not only achieves strong classification accuracy but also uncovers biologically meaningful interpretations by identifying key pathways supported by external biological databases such as KEGG and WikiPathways, thereby outperforming existing baselines. Specifically, our solution achieves more than three times higher biological interpretability scores for the Alzheimer’s disease condition compared to existing methods. Furthermore, our approach has the potential to enhance the interpretability of other Transformer-based models across application domains when integrated with relevant external knowledge.
Primary Area: applications to neuroscience & cognitive science
Submission Number: 14680
Loading