Causal Sparse Concepts for Faithful Explanations of Large Models

Published: 24 Apr 2026, Last Modified: 24 Apr 2026CauScale 2026EveryoneRevisionsCC BY 4.0
Keywords: Counterfactual explanations, Faithful Explanations, Black-box Models, Sparse Autoencoder
TL;DR: This paper proposes a sparse causal framework for faithful explanations of black-box time-series models
Abstract: As large pretrained models are increasingly deployed in high-stakes settings, faithful explanations of their predictions are essential for understanding and verification. Existing post-hoc methods often lack causal grounding and degrade under distribution shift, limiting their reliability for black-box models whose training data and internal representations are unknown. We introduce TimeSAE, a framework for learning sparse, causally grounded concept explanations for sequential models. TimeSAE builds on a Sparse Autoencoder with JumpReLU activations to learn an interpretable dictionary of temporal concepts and applies counterfactual interventions to estimate their causal influence on model predictions. Experiments on eight datasets and large pretrained models demonstrate consistent improvements over eight baselines, with stronger gains under distribution shift. Our code and datasets are available at: https://anonymous.4open.science/w/TimeSAE-571D/.
Submission Number: 31
Loading