Keywords: SAE, Sparse Autoencoder, Automated Interpretability, Test-time feature, SAE feature, Residual Stream, Bias mitigaiton, Jailbreaking prevention, Mechanistic Interpretability, AI Safety, AI Control
TL;DR: Existing steering approaches rely on contrastive datasets restricted to static contexts. In contrast, CorrSteer goes beyond by directly leveraging test-time activations, extending SAE-based steering and achieving practical gains across benchmarks.
Abstract: Sparse Autoencoders (SAEs) can extract interpretable features from large language models (LLMs) without supervision. However, existing SAE-based steering methods rely on contrastive activation differences or require large activation storage. To address these limitations, we propose CorrSteer, which extends SAE-based steering by directly leveraging generation-time activations. Our method selects features by correlating sample correctness with SAE activations from generated tokens, extracting task-relevant features while reducing spurious correlations. Steering coefficients are obtained from positive-sample activations, automating the entire pipeline. Our method shows improved task performance on QA, bias mitigation, jailbreaking prevention, and reasoning benchmarks on Gemma-2 2B and LLaMA-3.1 8B, notably achieving a +3.3\% improvement in MMLU performance with 4000 samples and a +27.2\% improvement in HarmBench with only 108 samples. Selected features demonstrate semantically meaningful patterns aligned with each task's requirements, revealing the underlying capabilities that drive performance. Our work establishes correlation-based selection as an effective and scalable approach for automated SAE steering across language model applications.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 10282
Loading