Evaluating the Utility of Sparse Autoencoders for Interpreting a Pathology Foundation Model

Published: 09 Oct 2025, Last Modified: 09 Oct 2025NeurIPS 2025 Workshop ImageomicsEveryoneRevisionsBibTeXCC BY 4.0
Submission Track: Full papers that have been published at a peer-reviewed venue after January 1st, 2024 (up to 9 pages, excluding references)
Keywords: sparse autoencoders, pathology, medical imaging
TL;DR: We demonstrate that sparse autoencoders can be used to discover monosemantic, biologically relevant representations from a pathology foundation model.
Abstract: Pathology plays an important role in disease diagnosis, treatment decision-making, and drug development. Previous works on interpretability for machine learning models on pathology images have revolved around methods such as attention value visualization and deriving human-interpretable features from model heatmaps. Mechanistic interpretability is an emerging area of model interpretability that focuses on reverse-engineering neural networks. Sparse Autoencoders (SAEs) have strong potential for extracting monosemantic concepts from polysemantic model activations. In this work, we train a Sparse Autoencoder on the embeddings of a pathology-pretrained foundation model. We find that Sparse Autoencoder features represent interpretable and monosemantic biological concepts. In particular, individual SAE dimensions show strong correlations with the counts of individual cell types, such as plasma cells and lymphocytes. These biological representations are unique to the pathology-pretrained model and are not found in a self-supervised model pretrained on natural images. These biologically grounded monosemantic representations evolve across the model’s depth, and the pathology foundation model eventually gains robustness to non-biological factors, such as scanner type. The emergence of these biologically-relevant SAE features are generalizable to an out-of-domain dataset. Finally, we highlight certain limitations of SAEs and why more work is needed towards achieving complete monosemanticity. Our work paves the way for further exploration around interpretable feature dimensions and their utility for medical and clinical applications.
Submission Number: 10
Loading