Keywords: steering, llm, diffusion, control
Abstract: An idea of steering intermediate representations of generative models has recently emerged as a simple yet powerful approach for controlling aspects of generated texts and images. However, despite the simplicity of the approach, no theoretical framework has yet been built around steering. In this paper, we aim to bridge this gap, building theory around concapt steering. First, we provide theoretical link between steering and affine concept erasure framework, showing that widely used steering setup for erasing unwanted behaviours or concepts from generative models is a special case of LEACE, a closed-form method for affine concept erasure in neural networks. Next, we consider the task of concept switching, the aim of which is to change information about unwanted concept or behaviour in the model’s representations into another, more desired concept or behaviour. Here our contribution is two-fold: first, we formulate a theoretical framework for this task, adapting existing affine concept erasure framework used for concept erasure. Then, we identify weaknesses of the resulting framework, and propose a new, improved one, that we call MIDSTEER (MInimal Disturbance concept STEERing). Our results show that MidSteer performs favourably on a variety of tasks modalities and models, including image generative diffusion models and LLMs.
Supplementary Material:  zip
Primary Area: generative models
Submission Number: 9248
Loading