Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering

Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering

ACL ARR 2025 July Submission974 Authors

29 Jul 2025 (modified: 04 Sept 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Linear concept vectors effectively steer LLMs, but existing methods suffer from noisy features in diverse datasets that undermine steering robustness. We propose Sparse Autoencoder-Denoised Concept Vectors (SDCV), which selectively keep the most discriminative SAE latents while reconstructing hidden representations. Our key insight is that concept-relevant signals can be explicitly separated from dataset noise by scaling up activations of top-k latents that best differentiate positive and negative samples. Applied to linear probing and difference-in-mean, SDCV consistently improves steering success rates by 4-16% across six challenging concepts, while maintaining topic relevance.

Paper Type: Short

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: hierarchical & concept explanations, model editing, probing

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 974

Loading