Behavioural Asymmetry Across Activation Interventions for Big Five Personality Control in LLMs

Published: 14 Jun 2026, Last Modified: 21 Jun 2026ICML 2026 Workshop MusIML PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: persona steering, mechanistic interpretability, alignment
Abstract: Recent work in Mechanistic Interpretability and Alignment has explored the steerability and localization of persona traits, usually focusing on alignment-relevant traits such as `evil' or observing task generalization in other domains, rather than varied personality expression. In this work, we systematically compare different interventions for steering Big Five personality traits in small instruction-tuned Llama models, across both high- and low-trait directions. Our results demonstrate that activation addition is the only method that reliably amplifies and suppresses trait expression across all five traits, in both high- and low-trait directions. Probe steering and directional ablation achieve partial control at best, failing to produce consistent bidirectional effects. Steerability also varies substantially across traits: Agreeableness and Extraversion respond most reliably to intervention, while Neuroticism resists steering in both directions. These results show that not all activation-based interventions are interchangeable, despite sharing the same geometric direction.
Track: Track 2: ML Research by Muslim Authors
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Non Archival Confirmation: I understand that submissions to MusIML are non-archival and can be submitted to other venues.
Submission Number: 32
Loading