White-Box Monitoring for Personality Mirroring in Conversational AI

Eitan Sprejer; Agustín E. Martínez-Suñé; Bruno Bianchi

White-Box Monitoring for Personality Mirroring in Conversational AI

Eitan Sprejer, Agustín E. Martínez-Suñé, Bruno Bianchi

Published: 28 Feb 2026, Last Modified: 04 Apr 2026CAO PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: personality mirroring, activation monitoring, representation engineering, behavioral drift, LLM safety, AI Safety, mechanistic interpretability, Large Language Models

TL;DR: Activation projections onto trait-space PCs detect persona shifts in Gemma-2-27B-it: the model mirrors user personality without instruction (d=3.4-6.4 across 140 topics), validated by surface features and an LLM judge.

Abstract: Conversational AI assistants can shift their personality or tone depending on who they interact with and what they discuss, a concern raised by recent ﬁndings of identity drift and persistent personality instability in large language models. We demonstrate a white-box method for detecting such shifts by projecting model activations onto trait-space principal components derived from prior work on personality vectors. Applying this to Gemma-2-27B-it across 2,940 simulated conversations where users embody contrasting personality styles (e.g., ironic vs. diplomatic), we ﬁnd that the model naturally mirrors user personality without explicit instruction, with consistent effect sizes (Cohen’s d = 3.4–6.4; 54–94% of conversation topics signiﬁcant after FDR correction). When conversing with neutral users (no assigned personality), we ﬁnd that different conversation domains also elicit distinct persona proﬁles: topic category explains 53–64% of variance in trait-space (e.g., Creative Writing and Politics occupy opposite ends of the agreeable–antagonistic axis). Both surface-level text features and an independent LLM-as-a-judge support that persona shifts detected in activations correspond to observable differences in model output. These ﬁndings suggest that activation-based monitoring could complement black-box behavioral observation for detecting personality drift in deployed large language models.

Submission Number: 119

Loading