Extraversion or Introversion? Controlling The Personality of Your Large Language Models

ACL ARR 2025 May Submission6145 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large language models (LLMs) excel in text generation and comprehension and often exhibit diverse synthetic personalities. However, some LLMs exhibit toxic or otherwise undesirable behaviors, posing risks to safe deployment. Existing prompt-based control methods often yield fragile personality steering that is vulnerable to adversarial attacks, whereas robust training-based approaches remain underexplored. To address these gaps, we constructed dedicated personality datasets and systematically investigated multiple control methods for influencing LLM personalities, including Continual Pre-Training (CPT), Supervised Fine-Tuning (SFT), Reinforcement Learning from Human Feedback (RLHF), and prompt-based inference techniques. Experimental results show that training-based methods achieve more stable and robust personality control, whereas prompt-based methods, although effective, remain susceptible to adversarial manipulation. Building on these findings, we introduce Prompt Induction post Supervised Fine-Tuning (PISF), a two-stage method that delivers superior effectiveness, robustness, and success rates in personality control. Extensive experiments validate PISF’s ability to enforce safe and consistent personality control, thereby advancing trustworthy AI applications.
Paper Type: Long
Research Area: Sentiment Analysis, Stylistic Analysis, and Argument Mining
Research Area Keywords: style generation, applications
Contribution Types: NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 6145
Loading