Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller

Published: 24 Jun 2024, Last Modified: 31 Jul 2024ICML 2024 MI Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM Control; Controlled Generation; Mechanistic Interpretability
Abstract: We propose $SelfControl$, a novel method utilizing suffix gradients to control the behavior of large language models (LLMs) without explicit human annotations. Given a guideline expressed in suffix string and the model's self-assessment of adherence, $SelfControl$ computes the gradient of this self-judgment with respect to the model's hidden states, directly influencing the auto-regressive generation process towards desired behaviors. To enhance efficiency, we introduce $SelfControl_{Preifx}$, a compact module that encapsulates the learned representations from suffix gradients into a Prefix Controller, facilitating inference-time control for various LLM behaviors. Our experiments demonstrate $SelfControl$'s efficacy across multiple domains, including emotional modulation, ensuring harmlessness, and enhancing complex reasoning. Especially, $SelfControl_{Preifx}$ enables a plug-and-play control and jointly control multiple attributes, improving model outputs without altering model parameters or increasing inference-time costs.
Submission Number: 54
Loading