Keywords: Interpretability for AI Safety, Feature Geometry, Applications of interpretability
Other Keywords: User Personas, Mechanistic Interpretability, AI Safety, Personalization, Refusal Behaviors, Representation Geometry, Language Models
TL;DR: User personas are encoded as coherent low-dimensional subspaces in language-model activations, helping to predict and causally modulate refusal behavior.
Abstract: As language-model chatbots increasingly use persistent user information, safety-relevant behaviors may depend not only on what is asked, but also on who the model represents the user to be. Prior work has shown that LLMs modulate refusal behavior based on perceived user personas. However, most studies examine this effect only at the behavioral level, while mechanistic analyses typically represent user personas as linear directions in activation space. We characterize user personas in terms of Knowledge, Intent, Emotion, and Belief, and decompose each into contextually distinct subcategories to study user-representation geometry. We find that user personas are encoded as coherent low-dimensional subspaces in activation space, rather than collapsing into a single generic user direction. These representations are behaviorally meaningful: projections onto directions within these subspaces predict model refusal for individual prompts, and interventions along them shift the model's inferred user profile. These findings show that personalized context can modulate safety behavior through structured internal user representations, with implications for auditing memory-enabled LLM systems.
Submission Number: 487
Loading