Keywords: Feature Geometry, Methods (probing, steering, causal interventions)
TL;DR: We identify a 2D valence–arousal subspace in LLM representations exhibiting circumplex emotion geometry, show it affords bidirectional multi-behavioral control from a single set of axes, and propose lexical mediation as a mechanistic account.
Abstract: We show that emotion vectors in LLMs are organized by a two-dimensional valence-arousal (VA) subspace exhibiting circular geometry. Through principal component decomposition and ridge regression, we recover meaningful VA axes underlying emotion steering vectors whose projections correlate with human affect ratings across 44,728 words. Steering along these axes produces monotonic control over the affective properties of generated text, and further affords bidirectional control over multiple downstream behaviors (refusal and sycophancy) from a single subspace. These effects replicate across Llama-3.1-8B, Qwen3-8B, and Qwen3-14B. We propose lexical mediation to explain why these effects and prior emotionally framed controls work: refusal and compliance tokens occupy distinct VA regions, and VA steering directly modulates their emission probabilities.
Submission Number: 355
Loading