C$^{2}$R: Cross-sample Consistency Regularization Mitigates Feature Splitting and Absorption in Sparse Autoencoders
Abstract: Sparse Autoencoders (SAEs) are widely used to interpret large language models by decomposing activations into sparse, human-understandable features, but scaling to large dictionaries exposes fundamental challenges. Systematic studies reveal pervasive feature splitting that fragments coherent concepts into non-atomic latents and widespread feature absorption that creates arbitrary exceptions in general features, severely compromising latent reliability. These issues stem from inconsistent latent assignment across samples: without cross-sample constraints, per-sample optimization often allows a single underlying concept to be inconsistently distributed across multiple redundant or interfering latents. To address this, we introduce C$^2$R (\underline{\textbf{C}}ross-sample \underline{\textbf{C}}onsistency \underline{\textbf{R}}egularization). C$^2$R explicitly encourages that each semantic feature is consistently represented by a unified latent across the batch by penalizing the co-activation of directionally similar latents. Comprehensive evaluation demonstrates that C$^2$R effectively mitigates both splitting and absorption while, crucially, preserving reconstruction fidelity, providing a principled solution that enhances latent interpretability without degrading model performance. Source code is available\footnote{\url{https://github.com/hr-jin/Cross-sample-Consistency-Regularization}}.
Lay Summary: Sparse autoencoders (SAEs) are used to interpret large language models by decomposing their internal activations into sparse, human-understandable features. Ideally, each learned feature corresponds to a single coherent concept. In practice, scaling SAEs to large dictionaries introduces two failure modes: feature splitting, where one concept is fragmented across multiple redundant features, and feature absorption, where a general feature develops arbitrary exceptions by absorbing unrelated information. Both problems reduce the reliability of learned features for downstream analysis. We introduce C$^2$R, a regularization method that penalizes inconsistent feature assignment across samples in a training batch. C$^2$R encourages each concept to be captured by a single unified feature rather than distributed across redundant ones. Experiments on Gemma-2-2B, Qwen3-8B, and Llama-3-8B show that C$^2$R reduces splitting and absorption while preserving reconstruction fidelity and interpretability. The method is architecture-agnostic and can be applied on top of existing SAE variants. These results improve the reliability of mechanistic interpretability tools for understanding model internals.
Link To Code: https://github.com/hr-jin/Cross-sample-Consistency-Regularization
Primary Area: Social Aspects->Accountability, Transparency, and Interpretability
Keywords: sparse autoencoders, natural language explanations, concept explanations
Originally Submitted PDF: pdf
Submission Number: 34488
Loading