Tradeoffs Between Alignment and Helpfulness in Language Models with Steering Methods

Published: 06 Mar 2025, Last Modified: 18 Mar 2025ICLR 2025 FM-Wild WorkshopEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Language model alignment, representation engineering, activation steering, feature steering
TL;DR: Tradeoffs between alignment and helpfulness of language models due to the use of steering methods.
Abstract: Language model alignment has become an important component of AI safety, allowing safe interactions between humans and language models, by enhancing desired behaviors and inhibiting undesired ones. It is often done by tuning the model or inserting preset aligning prompts. Recently, steering methods, such as representation engineering and feature steering and activation steering, methods which alters the model's behavior via changing its representations post-training, were shown to be effective in aligning LLMs. Steering methods yield gains in alignment oriented tasks such as resistance to adversarial attacks and reduction of social biases, but were also shown to cause a decrease in the ability of the model to perform basic tasks. In this paper we study the tradeoff between the increase in alignment and decrease in helpfulness of the model. We propose a theoretical framework which provides bounds for these two quantities, and demonstrate their relevance empirically. First, we find that under the conditions of our framework, alignment can be guaranteed with steering methods, and at the same time that helpfulness is harmed in the process. Second, we show that helpfulness is harmed quadratically with the norm of the injected steering vectors, while the alignment increases linearly with it, indicating a regime in which it is efficient to use representation engineering. We validate our findings empirically, and chart the boundaries to the usefulness of these methods for alignment.
Submission Number: 93
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview