Keywords: political bias, LLMs, activation steering, activation patching, bias auditing
TL;DR: We audit emotion-vector steering in open-weight LLMs on voter-advice questionnaires, finding valence-driven shifts in concrete policy answers while party rankings stay anchored and moral-reasoning vectors show little political transfer.
Abstract: Large language models (LLMs) contain internal directions in their residual streams corresponding to discrete emotions and political stance. These directions encode internal emotional states, and they express measurable political preferences. We ask whether internal emotion steering shifts the political content a deployed model would produce, and whether analogous directions extracted for deontological and consequentialist moral reasoning exert comparable effects in a deployment-realistic setting. Our headline experiment evaluates emotion-steered LLMs on the German voter-advice questionnaire. On the model where the diagnostic clears the emotion vectors and answer variance is non-trivial (Mistral-7B-Instruct-v0.3; Qwen-2.5-7B-Instruct locks on $89%$ of Wahl-O-Mat theses, Gemma-2-9B-IT on $95%$), positive-valence emotions push individual policy answers toward left-leaning positions and negative-valence emotions push them toward right, while party-rank outputs remain anchored across all $84$ steered conditions; the directional bias surfaces at the level of concrete policy questions rather than partisan identity. We extend the same extraction protocol to a deontology--consequentialism contrast; on Mistral-7B the resulting vector validates on the on-target moral-reasoning probe with the predicted sign (per-scenario slope $=-0.50$, $p=0.011$, $r(\alpha,\text{stance})=-0.97$) but does not transfer to either political instrument, while on Gemma-2-9B-IT the same vector is null on the probe. Taken together, these findings argue that representation-steering safety audits must distinguish concrete policy stance from partisan identity, synthetic political-compass scores from deployment-realistic political behaviour, and must report direction-of-effect alongside dose-response shape rather than just single-strength magnitudes.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 478
Loading