Behavioural Asymmetry Across Activation Interventions for Big Five Personality Control in LLMs

Hala Sheta

Behavioural Asymmetry Across Activation Interventions for Big Five Personality Control in LLMs

Hala Sheta

Published: 14 Jun 2026, Last Modified: 21 Jun 2026ICML 2026 Workshop MusIML PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: persona steering, mechanistic interpretability, alignment

Abstract: Recent work in Mechanistic Interpretability and Alignment has explored the steerability and localization of persona traits, usually focusing on alignment-relevant traits such as `evil' or observing task generalization in other domains, rather than varied personality expression. In this work, we systematically compare different interventions for steering Big Five personality traits in small instruction-tuned Llama models, across both high- and low-trait directions. Our results demonstrate that activation addition is the only method that reliably amplifies and suppresses trait expression across all five traits, in both high- and low-trait directions. Probe steering and directional ablation achieve partial control at best, failing to produce consistent bidirectional effects. Steerability also varies substantially across traits: Agreeableness and Extraversion respond most reliably to intervention, while Neuroticism resists steering in both directions. These results show that not all activation-based interventions are interchangeable, despite sharing the same geometric direction.

Track: Track 2: ML Research by Muslim Authors

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Non Archival Confirmation: I understand that submissions to MusIML are non-archival and can be submitted to other venues.

Submission Number: 32

Loading