From Traits to Circuits: Toward Mechanistic Interpretability of Personality in Large Language Models

ICLR 2026 Conference Submission11066 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: feature attribution, explanation faithfulness, probing, hierarchical & concept explanations
Abstract: Large language models (LLMs) have been observed to exhibit personality-like behaviors when prompted with standardized psychological assessments. However, existing approaches treat personality as a black-box property, relying solely on behavioral probing while offering limited insight into the internal mechanisms responsible for personality expression. In this work, we take a mechanistic interpretability perspective and investigate whether personality traits in LLMs correspond to identifiable internal computation paths. To this end, we construct \textsc{TraitTrace}, a dataset designed to elicit distinct personality traits and support structural tracing. Using this dataset, we identify personality circuits as minimal functional subgraphs within the model’s computation graph that give rise to trait-specific responses. We then analyze the structural properties of these circuits across model layers and personality traits, and conduct causal interventions to probe the influence of individual components. Our findings offer a novel structural view of personality in LLMs, providing a bridge between behavioral psychology and mechanistic interpretability.
Primary Area: interpretability and explainable AI
Submission Number: 11066
Loading