Investigating Anthropomorphism in Large Language Models through the Lens of Sparse Autoencoders

Investigating Anthropomorphism in Large Language Models through the Lens of Sparse Autoencoders

ACL ARR 2026 January Submission10762 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Mechanistic Interpretability, Sparse Autoencoders

Abstract: In this work, we identify several intriguing internal mechanisms shared across diverse Large Language Models (LLMs) during prompt processing. These behaviors are not explicitly trained, yet they arise reliably across model families and scales and exert influence on model behavior. By adopting a cognitive-inspired perspective, we demonstrate that these patterns resemble established heuristics in human information processing, such as implicit structural segmentation, forming unconscious expectations, and the dynamic adaptation of internal resources under constraint. Using sparse autoencoders (SAEs) and decoder logit lens as analytical tools, we uncover multiple such phenomena, including (1) internal semantic parsing features that track document structure; (2) cross-exemplar interactions, where current representations are modulated by expectations induced by prior context; (3) role-adaptive features that exhibit functional plasticity by dynamically shifting their semantic profile based on contextual constraints; and (4) implicit expectations regarding the number of few-shot exemplars. We statistically validate these behaviors across multiple model architectures, suggesting that LLMs develop internal heuristics that, while not explicitly human, exhibit striking structural similarities to patterns observed in human cognition.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: hierarchical & concept explanations, feature attribution

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 10762

Loading