Dual Mechanisms of Value Expression: Decomposing Intrinsic and Prompted Values in Language Models

Published: 30 Sept 2025, Last Modified: 30 Sept 2025Mech Interp Workshop (NeurIPS 2025) PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Understanding high-level properties of models, Steering
TL;DR: We examine if feature directions for value expression differ depending on system prompt presence, and whether each direction exerts an independent causal effect.
Abstract: While prompting is commonly used for assigning personas to LLMs, the fundamental question of how LLMs internally represent values remains unanswered. We observe that LLMs can express human values through two mechanisms: $\textit{intrinsic value expression}$ (inherent value-laden response patterns) and $\textit{prompted value expression}$ (value-laden response patterns following explicit instructions). We formalize these value expressions as feature directions in the model's residual stream and extract intrinsic and prompted value directions using the difference-in-means method. By comparing these directions, we investigate whether intrinsic and prompted value expressions rely on the same underlying mechanisms. Interventions using these directions show that both value directions can induce the model to express target values in its output. We find that even after removing the intrinsic value direction component from the prompted value direction, the remaining component can still steer the model's behavior. This suggests that while both directions produce similar outcomes, they use distinct neural mechanisms. Furthermore, we show that leveraging both intrinsic and prompted value direction is more effective for steering value expression than using either direction alone.
Submission Number: 260
Loading