Activation Steering for LLM Alignment via a Unified ODE-Based Framework

ICLR 2026 Conference Submission9824 Authors

Published: 26 Jan 2026, Last Modified: 26 Jan 2026ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM alignment, Representation Engineering, Activation Steering, ODE-based Framework, Barrier Functions
TL;DR: We propose a unified ODE-based framework for activation steering and introduce \textsc{Bodes}, a method derived from our ODE-based framework that significantly outperform current SOTA activation steering methods
Abstract: Activation steering, or representation engineering, offers a lightweight approach to align large language models (LLMs) by manipulating their internal activations at inference time. However, current methods suffer from two key limitations: \textit{(i)} the lack of a unified theoretical framework for guiding the design of steering directions, and \textit{(ii)} an over-reliance on \textit{one-step steering} that fail to capture complex patterns of activation distributions. In this work, we propose a unified ordinary differential equations (ODEs)-based \textit{theoretical} framework for activation steering in LLM alignment. We show that conventional activation addition can be interpreted as a first-order approximation to the solution of an ODE. Based on this ODE perspective, identifying a steering direction becomes equivalent to designing a \textit{barrier function} from control theory. Derived from this framework, we introduce \textsc{Bodes} (\textbf{B}arrier function-guided \textbf{ODE} \textbf{S}teering), which shows \textit{empirical} advancement in LLM alignment. \textsc{Bodes} identifies steering directions by defining the barrier function as the log-density ratio between positive and negative activations, and employs it to construct an ODE for \textit{multi-step and adaptive} steering. Compared to state-of-the-art activation steering methods, \textsc{Bodes} achieves consistent empirical improvements on diverse LLM alignment benchmarks, a notable 7\% improvement over TruthfulQA, and 2\% over RealToxicityPrompts, and 2% over UltraFeedback. Our work establishes a principled new view of activation steering in LLM alignment by unifying its theoretical foundations via ODEs, and validating it empirically through the proposed \textsc{Bodes} method. We will release our source code after the paper is published.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 9824
Loading