Keywords: activation steering, model conditioning, evaluation
TL;DR: A systematic evaluation of LLM conditioning reveals that no single method provides satisfactory results both in terms of steering success and fluency across different model types and tasks.
Abstract: Controlling the output of Large Language Models (LLMs) is a central challenge for their safe and reliable deployment, yet a clear understanding of the trade-offs involved remains elusive. Current approaches to conditioning generation, spanning from expensive fine-tuning to lightweight activation steering or basic prompting, are often evaluated with a narrow focus on their effectiveness at injecting or removing a target concept, neglecting critical side effects on the quality of the generation. This paper presents a systematic investigation of these methods in both injection and removal scenarios, introducing a comprehensive evaluation framework to assess generation quality and move beyond unreliable measures like perplexity. Our analysis reveals that the latter is a fundamentally brittle proxy for fluency, often rewarding repetitive text while penalizing well-formed outputs.
Using more robust metrics, we find that there is no ``free lunch'' in conditioning: lightweight methods frequently achieve conditioning at a steep cost to expressiveness. Furthermore, we identify a critical yet previously overlooked interaction with the training paradigm: activation steering methods are far less effective on instruction-tuned models than on their base counterparts.
While supervised fine-tuning emerges as the most robust method, it also exhibits significant side effects, such as collateral learning of possibly undesired linguistic characteristics of the training set. Finally, simple prompting might be an alternative to more sophisticated conditioning methods for basic concept injection, but it fails to scale to tasks requiring more thorough output control, such as concept removal. Collectively, our findings challenge common assumptions in the field, providing a more realistic characterization of the conditioning landscape and a simple but principled methodology for future evaluation.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 13052
Loading