Keywords: Interpretability for Knowledge Discovery, Other, Methods (probing, steering, causal interventions)
Other Keywords: Reproducibility; open-source models; emotion vectors;
TL;DR: We replicate the emotion findings from Claude Sonnet 4.5 in two smaller open-weight models, finding that valence structure generalises across architectures but emerges differently across model dept
Abstract: Recent work identified ``emotion vectors'' in Claude Sonnet 4.5, which are internal representations that encode emotion concepts, causally influence behavior, and exhibit geometry mirroring human psychological structure.
We test the generality of these findings in two open-weight models, Apertus-8B and Gemma-4-E4B, extracting emotion contrast vectors across all layers, using two model-generated corpora. We recover valence geometry for both models, with peak PC1--valence correlations of $r = 0.76$ and $r = 0.83$, approaching the $r = 0.81$ reported for Claude.
Beyond replication, we observe notable differences in how valence representations emerge across model depth. In Gemma-4-E4B, valence is strongly encoded in early layers but collapses towards later layers, whereas Apertus-8B exhibits the opposite pattern, with valence representations absent in early layers, but emerging at mid depths.
Arousal encoding, in contrast, is sensitive to the extraction corpus: both models show stronger PC2--arousal alignment with Gemma-generated stories ($r$ up to $0.45$) than Apertus-generated ones ($r \leq 0.21$), suggesting arousal-relevant cues are unevenly distributed across generated corpora. We open-source our experiment code and dataset for reproducible investigation of emotion representations across language model architectures.
Submission Number: 160
Loading