Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering

ACL ARR 2025 July Submission974 Authors

29 Jul 2025 (modified: 04 Sept 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Linear concept vectors effectively steer LLMs, but existing methods suffer from noisy features in diverse datasets that undermine steering robustness. We propose Sparse Autoencoder-Denoised Concept Vectors (SDCV), which selectively keep the most discriminative SAE latents while reconstructing hidden representations. Our key insight is that concept-relevant signals can be explicitly separated from dataset noise by scaling up activations of top-k latents that best differentiate positive and negative samples. Applied to linear probing and difference-in-mean, SDCV consistently improves steering success rates by 4-16% across six challenging concepts, while maintaining topic relevance.
Paper Type: Short
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: hierarchical & concept explanations, model editing, probing
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 974
Loading