Personality as a Probe for LLM Evaluation: Method Trade-offs and Downstream Effects

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: personality manipulation, large language models, evaluation, interpretability, stability, bias, parameter-efficient fine-tuning, in-context learning, mechanistic steering
TL;DR: We propose a contrastive dataset and unified evaluation framework to study personality manipulation in LLMs, analyzing method trade-offs across capability, bias, and stability.
Abstract: Personality manipulation in large language models (LLMs) is increasingly applied in customer service and agentic scenarios, yet its mechanisms and trade-offs remain unclear. We present a systematic study of personality control using the Big Five traits, comparing in-context learning (ICL), parameter-efficient fine-tuning (PEFT), and mechanistic steering (MS). Our contributions are fourfold. First, we construct a contrastive dataset with balanced high/low trait responses, enabling effective steering vector computation and fair cross-method evaluation. Second, we introduce a unified evaluation framework based on within-run $\Delta$ analysis that disentangles, reasoning capability, agent performance, and demographic bias across MMLU, GAIA, and BBQ benchmarks. Third, we develop trait purification techniques to separate openness from conscientiousness, addressing representational overlap in trait encoding. Fourth, we propose a three-level stability framework that quantifies method-, trait-, and combination-level robustness, offering practical guidance under deployment constraints. Experiments on Gemma-2-2B-IT and LLaMA-3-8B-Instruct reveal clear trade-offs: ICL achieves strong alignment with minimal capability loss, PEFT delivers the highest alignment at the cost of degraded task performance, and MS provides lightweight runtime control with competitive effectiveness. Trait-level analysis shows openness as uniquely challenging across methods and personality encoding consolidating around intermediate layers. Taken together, these results provide a rigorous comparative analysis of how different adaptation techniques (surface-level prompting, parameter-efficient fine-tuning, and activation-level steering) impact model performance and behavior. This work establishes a framework for assessing the trade-offs between behavioral alignment, capability degradation, and deployment efficiency, offering critical insights for practitioners navigating the LLM adaptation lifecycle..
Submission Number: 167
Loading