Toward Monosemantic Clinical Explanations for Alzheimer’s Diagnosis via Attribution and Mechanistic Interpretability
Keywords: mechanistic-interpretability, attributional-interpretability, neurodegenerative-disease, LLM, sparse-autoencoders, explanation-optimizer
Abstract: Interpretability remains a major obstacle to deploying large language models (LLMs) in high-stakes settings such as Alzheimer’s disease (AD) progression diagnosis, where early and explainable predictions are essential. Traditional attribution methods suffer from inter-method variability and often produce unstable explanations due to the polysemantic nature of LLM representations, while mechanistic interpretability lacks direct alignment with model inputs/outputs and does not provide importance scores. We propose a unified interpretability framework that integrates attributional and mechanistic perspectives through monosemantic feature extraction. Our approach evaluates six attribution techniques, refines them using a learning-based explanation optimizer, and employs sparse autoencoders (SAEs) to map LLM activations into a disentangled latent space that supports clearer and more coherent attribution analysis. Comparing latent-space and native attributions, we observe substantial gains in robustness, consistency, and semantic clarity. Experiments on IID and OOD Alzheimer’s cohorts across binary and three-class tasks demonstrate that our framework yields more reliable, clinically aligned explanations and reveals meaningful diagnostic patterns. This work advances the safe and trustworthy use of LLMs in cognitive health and neurodegenerative disease assessment.
Supplementary Material: pdf
Primary Area: interpretability and explainable AI
Submission Number: 22121
Loading