Do Sparse Autoencoders Reveal Faithful Concepts? A Neuron-Level Cross-Validation with QA-Based Probes

Do Sparse Autoencoders Reveal Faithful Concepts? A Neuron-Level Cross-Validation with QA-Based Probes

ACL ARR 2026 January Submission7768 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Neuron interpretability, Sparse autoencoders, Large language models, Faithfulness evaluation, Semantic probes, Model interpretability, Representation analysis

Abstract: Sparse Autoencoders (SAEs) have recently been adopted to interpret large language models by decomposing hidden activations into sparse, human-interpretable “concept features.” These features are often assumed to provide faithful semantic explanations of individual neurons. However, despite their growing use, it remains unclear whether SAE-discovered concepts genuinely reflect neuron semantics or instead arise from reconstruction-induced artifacts. We propose a neuron-level cross-validation framework that evaluates SAE-based interpretations using an independent semantic signal. Specifically, we leverage question–answering embeddings (QA-Emb), in which each dimension corresponds to an explicit yes/no semantic query posed to a language model, as an external probe that is independent of SAE training objectives. For each neuron, we derive parallel interpretations from SAE features and QA-based probes, and quantify their agreement using semantic similarity and sign-consistency criteria. Across experiments on large-scale news data and multiple auxiliary domains, we find that while many neurons exhibit partial or strong agreement between SAE and QA interpretations, a substantial fraction shows systematic divergence. Through qualitative analysis and controlled diagnostics, we identify recurring failure modes in which SAE features appear semantically coherent yet lack support from independent QA evidence. These discrepancies highlight cases where apparent interpretability does not imply faithfulness. Our results demonstrate that independent semantic cross-validation is essential for assessing the reliability of reconstruction-based neuron interpretations, and provide practical diagnostic tools for evaluating faithfulness in large language models.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: interpretability, model analysis, probing methods, representation analysis, faithfulness

Contribution Types: Model analysis & interpretability, Data analysis

Languages Studied: English, Chinese

Submission Number: 7768

Loading