White-Box Monitoring for Personality Mirroring in Conversational AI

Published: 28 Feb 2026, Last Modified: 04 Apr 2026CAO PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: personality mirroring, activation monitoring, representation engineering, behavioral drift, LLM safety, AI Safety, mechanistic interpretability, Large Language Models
TL;DR: Activation projections onto trait-space PCs detect persona shifts in Gemma-2-27B-it: the model mirrors user personality without instruction (d=3.4-6.4 across 140 topics), validated by surface features and an LLM judge.
Abstract: Conversational AI assistants can shift their personality or tone depending on who they interact with and what they discuss, a concern raised by recent findings of identity drift and persistent personality instability in large language models. We demonstrate a white-box method for detecting such shifts by projecting model activations onto trait-space principal components derived from prior work on personality vectors. Applying this to Gemma-2-27B-it across 2,940 simulated conversations where users embody contrasting personality styles (e.g., ironic vs. diplomatic), we find that the model naturally mirrors user personality without explicit instruction, with consistent effect sizes (Cohen’s d = 3.4–6.4; 54–94% of conversation topics significant after FDR correction). When conversing with neutral users (no assigned personality), we find that different conversation domains also elicit distinct persona profiles: topic category explains 53–64% of variance in trait-space (e.g., Creative Writing and Politics occupy opposite ends of the agreeable–antagonistic axis). Both surface-level text features and an independent LLM-as-a-judge support that persona shifts detected in activations correspond to observable differences in model output. These findings suggest that activation-based monitoring could complement black-box behavioral observation for detecting personality drift in deployed large language models.
Submission Number: 119
Loading