Do Sparse Autoencoders Reveal Faithful Concepts? A Neuron-Level Cross-Validation with QA-Based Probes
Keywords: Neuron interpretability, Sparse autoencoders, Large language models, Faithfulness evaluation, Semantic probes, Model interpretability, Representation analysis
Abstract: Sparse Autoencoders (SAEs) have recently been adopted to interpret large language models by decomposing hidden activations into sparse, human-interpretable “concept features.” These features are often assumed to provide faithful semantic explanations of individual neurons. However, despite their growing use, it remains unclear whether SAE-discovered concepts genuinely reflect neuron semantics or instead arise from reconstruction-induced artifacts.
We propose a neuron-level cross-validation framework that evaluates SAE-based interpretations using an independent semantic signal. Specifically, we leverage question–answering embeddings (QA-Emb), in which each dimension corresponds to an explicit yes/no semantic query posed to a language model, as an external probe that is independent of SAE training objectives. For each neuron, we derive parallel interpretations from SAE features and QA-based probes, and quantify their agreement using semantic similarity and sign-consistency criteria.
Across experiments on large-scale news data and multiple auxiliary domains, we find that while many neurons exhibit partial or strong agreement between SAE and QA interpretations, a substantial fraction shows systematic divergence. Through qualitative analysis and controlled diagnostics, we identify recurring failure modes in which SAE features appear semantically coherent yet lack support from independent QA evidence. These discrepancies highlight cases where apparent interpretability does not imply faithfulness.
Our results demonstrate that independent semantic cross-validation is essential for assessing the reliability of reconstruction-based neuron interpretations, and provide practical diagnostic tools for evaluating faithfulness in large language models.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: interpretability, model analysis, probing methods, representation analysis, faithfulness
Contribution Types: Model analysis & interpretability, Data analysis
Languages Studied: English, Chinese
Submission Number: 7768
Loading