Keywords: test-time methods, steering, LLMs, safety, instruction following, controllable generation
Abstract: Controlling the generation of large language models (LLMs) remains a central challenge to ensure they are both reliable and adaptable. Two common inference-time intervention approaches for this are instruction prompting, which provides natural language guidance, and latent steering, which directly modifies the model's internal activations to guide its behavior. Recently, attention manipulation methods have emerged that can enforce arbitrary user-provided instructions, representing a promising third approach for behavioral control. However, these methods have yet to be systematically compared against established approaches on complex behavioral tasks. Furthermore, existing methods suffer from critical limitations, requiring either computationally expensive head selection or, as we show, risk degrading generation quality by over-focusing on instructions. To address the evaluation gap, we establish a unified benchmark comparing these intervention approaches across 15 diverse behavioral control tasks. To address the technical limitations, we introduce Instruction Attention Boosting (InstABoost), a simple and efficient method that multiplicatively boosts attention to instruction tokens, avoiding the trade-offs of prior work. On our benchmark, InstABoost consistently outperforms or is competitive with all baselines, establishing attention manipulation as a robust method for behavioral control that preserves generation quality.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 22587
Loading