Abstract: Precise control over language model generation is vital for ensuring both safety and reliability. While prompt engineering and steering are commonly used to influence model behaviors, the vast number of parameters in large language models (LLMs) often results in highly intertwined internal representations. This interdependency can limit control precision and sometimes lead to unintended side effects. Recent research has explored the use of sparse autoencoders (SAEs) to disentangle knowledge in high-dimensional spaces for steering. However, these applications have been limited to toy tasks owing to the nontrivial issue of locating ``atomic knowledge components''. In this paper, we propose \textbf{Steering Target Atoms (STA)}, a novel method that isolates and manipulates disentangled knowledge components to enhance safety and align personality traits in LLMs. Comprehensive experiments demonstrate the effectiveness of our approach: steering with STA exhibits superior robustness and flexibility, particularly in adversarial scenarios. We also apply STA to o1-like models, confirming its effectiveness in precise reasoning control. Our findings highlight the potential of steering through disentangled representations to achieve reliable and precise control over language model behaviors.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: feature attribution, knowledge tracing/discovering/inducing, probing
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 3695
Loading