Probing Clinical Concepts in an EHR Foundation Model via Sparse Autoencoders

Published: 23 May 2026, Last Modified: 23 May 2026SD4H ICML 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: foundation model, Electronic health records
TL;DR: Sparse autoencoders applied to an EHR foundation model recover candidate clinical syndromes as monosemantic features and reveal cross-layer circuits validated by activation patching.
Abstract: Foundation models (FMs) trained on large electronic health record (EHR) datasets can predict patient outcomes, but it is difficult to know what medical knowledge they have acquired. Unlike chatbot LLMs, EHR-FMs are being considered for high-stakes clinical deployment, making it especially important to audit what they have learned beyond predictive accuracy. We apply sparse autoencoders (SAEs) to a transformer-based FM trained on the MIMIC-IV dataset, extending SAE-based mechanistic interpretability to FMs trained on clinical event streams. We use LLM-based interpretation to characterize learned features, revealing that EHR models learn a clinical ontology distinct from the International Classification of Diseases (ICD) system. We show that learned features are organized by prevalence and that the model encodes candidate matches to known clinical syndromes as single monosemantic features. Syndromic features are composed from lower-level features through cross-layer information-flow circuits that we probe via activation patching. We validate the learned features along two axes: external validity, where feature activations align with held-out ICD phenotypes, and interventional consistency, where activation patching produces measurable downstream effects in source-target pairs. Together, these results demonstrate the utility of SAEs as an interpretive layer for EHR foundation models.
Submission Number: 153
Loading