Beyond Prompt Engineering: Robust Behavior Control in LLMs via Steering Target Atoms

Beyond Prompt Engineering: Robust Behavior Control in LLMs via Steering Target Atoms

ACL ARR 2025 February Submission3695 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Precise control over language model generation is vital for ensuring both safety and reliability. While prompt engineering and steering are commonly used to influence model behaviors, the vast number of parameters in large language models (LLMs) often results in highly intertwined internal representations. This interdependency can limit control precision and sometimes lead to unintended side effects. Recent research has explored the use of sparse autoencoders (SAEs) to disentangle knowledge in high-dimensional spaces for steering. However, these applications have been limited to toy tasks owing to the nontrivial issue of locating ``atomic knowledge components''. In this paper, we propose \textbf{Steering Target Atoms (STA)}, a novel method that isolates and manipulates disentangled knowledge components to enhance safety and align personality traits in LLMs. Comprehensive experiments demonstrate the effectiveness of our approach: steering with STA exhibits superior robustness and flexibility, particularly in adversarial scenarios. We also apply STA to o1-like models, confirming its effectiveness in precise reasoning control. Our findings highlight the potential of steering through disentangled representations to achieve reliable and precise control over language model behaviors.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: feature attribution, knowledge tracing/discovering/inducing, probing

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 3695

Loading