Where Do Models Find Happiness? Emotion Vectors in Open-Source LLMs

Sinie van der Ben; Raphaël Baur; Yannick Metz; Mennatallah El-Assady

Where Do Models Find Happiness? Emotion Vectors in Open-Source LLMs

Sinie van der Ben, Raphaël Baur, Yannick Metz, Mennatallah El-Assady

Published: 11 Jun 2026, Last Modified: 11 Jun 2026Mech Interp Workshop ICML 2026 VirtualposterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Interpretability for Knowledge Discovery, Other, Methods (probing, steering, causal interventions)

Other Keywords: Reproducibility; open-source models; emotion vectors;

TL;DR: We replicate the emotion findings from Claude Sonnet 4.5 in two smaller open-weight models, finding that valence structure generalises across architectures but emerges differently across model dept

Abstract: Recent work identified ``emotion vectors'' in Claude Sonnet 4.5, which are internal representations that encode emotion concepts, causally influence behavior, and exhibit geometry mirroring human psychological structure. We test the generality of these findings in two open-weight models, Apertus-8B and Gemma-4-E4B, extracting emotion contrast vectors across all layers, using two model-generated corpora. We recover valence geometry for both models, with peak PC1--valence correlations of $r = 0.76$ and $r = 0.83$, approaching the $r = 0.81$ reported for Claude. Beyond replication, we observe notable differences in how valence representations emerge across model depth. In Gemma-4-E4B, valence is strongly encoded in early layers but collapses towards later layers, whereas Apertus-8B exhibits the opposite pattern, with valence representations absent in early layers, but emerging at mid depths. Arousal encoding, in contrast, is sensitive to the extraction corpus: both models show stronger PC2--arousal alignment with Gemma-generated stories ($r$ up to $0.45$) than Apertus-generated ones ($r \leq 0.21$), suggesting arousal-relevant cues are unevenly distributed across generated corpora. We open-source our experiment code and dataset for reproducible investigation of emotion representations across language model architectures.

Submission Number: 160

Loading