Toward Monosemantic Clinical Explanations for Alzheimer’s Diagnosis via Attribution and Mechanistic Interpretability

Toward Monosemantic Clinical Explanations for Alzheimer’s Diagnosis via Attribution and Mechanistic Interpretability

ICLR 2026 Conference Submission22121 Authors

19 Sept 2025 (modified: 15 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: mechanistic-interpretability, attributional-interpretability, neurodegenerative-disease, LLM, sparse-autoencoders, explanation-optimizer

Abstract: Interpretability remains a central barrier to the safe deployment of large language models (LLMs) in high-stakes domains such as neurodegenerative disease diagnosis. In Alzheimer’s disease (AD), early and explainable predictions are critical for clinical decision-making, yet attribution-based methods (e.g., saliency maps, SHAP) often suffer from inconsistency due to the polysemantic nature of LLM representations. Mechanistic interpretability promises to uncover more coherent features, but it is not directly aligned with individual model outputs, limiting its applicability in practice. To address these limitations, we propose a unified interpretability framework that integrates attributional and mechanistic perspectives via monosemantic feature extraction. First, we evaluate six common attribution techniques and further develop an explanation-optimization step that updates explanations to reduce inter-method variability and improve clarity. In the second stage, we train sparse autoencoders (SAEs) to transform LLM activations into a disentangled latent space in which each dimension corresponds to a coherent semantic concept. This monosemantic representation enables more structured and interpretable attribution analysis. We then compare feature attributions in this latent space with those from the original model, demonstrating improved robustness and semantic clarity. Evaluations on in-distribution (IID) and out-of-distribution (OOD) Alzheimer’s cohorts across binary and three-class classification tasks confirm the effectiveness of our framework. By bridging attributional relevance and mechanistic clarity, our approach provides more trustworthy, consistent, and human-aligned explanations, and reveals clinically meaningful patterns in multimodal AD data. This work takes a step toward safer and more reliable integration of LLMs into cognitive health applications and clinical workflows.

Supplementary Material: pdf

Primary Area: interpretability and explainable AI

Submission Number: 22121

Loading