From Traits to Circuits: Toward Mechanistic Interpretability of Personality in Large Language Models
Keywords: feature attribution, explanation faithfulness, probing, hierarchical & concept explanations
Abstract: Large language models (LLMs) have been observed to exhibit personality-like behaviors when prompted with standardized psychological assessments. However, existing approaches treat personality as a black-box property, relying solely on behavioral probing while offering limited insight into the internal mechanisms responsible for personality expression. In this work, we take a mechanistic interpretability perspective and investigate whether personality traits in LLMs correspond to identifiable internal computation paths. To this end, we construct \textsc{TraitTrace}, a dataset designed to elicit distinct personality traits and support structural tracing. Using this dataset, we identify personality circuits as minimal functional subgraphs within the model’s computation graph that give rise to trait-specific responses. We then analyze the structural properties of these circuits across model layers and personality traits, and conduct causal interventions to probe the influence of individual components. Our findings offer a novel structural view of personality in LLMs, providing a bridge between behavioral psychology and mechanistic interpretability.
Primary Area: interpretability and explainable AI
Submission Number: 11066
Loading