Do Language Models Internalize Human-Like Stereotype Structures? Uncovering and Modulating Stereotype Utility Structure in LLMs

Do Language Models Internalize Human-Like Stereotype Structures? Uncovering and Modulating Stereotype Utility Structure in LLMs

ACL ARR 2025 May Submission1115 Authors

16 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: While large language models (LLMs) are known to exhibit stereotyped outputs, it remains unclear whether such biases reflect a structured, human-like internal organization. Drawing on the Stereotype Content Model (SCM) from social psychology, we propose that LLMs internalize a low-dimensional stereotype utility space along Warmth and Competence axes. We introduce a stereotype utility probing framework that combines pairwise contrastive prompting with Thurstonian modeling to infer latent group preferences across multiple LLMs. Our analysis shows that this utility structure robustly recapitulates canonical human stereotype patterns, is stable across models and prompts, and shifts predictably under political conditioning. By probing attention heads, we further localize the encoding of these social dimensions, and show that targeted interventions can causally modulate the affective framing of model outputs. Our findings reveal that LLMs not only exhibit human-like stereotype structures, but also encode them in functionally actionable internal representations, opening new avenues for diagnosis and mitigation of social bias.

Paper Type: Long

Research Area: Ethics, Bias, and Fairness

Research Area Keywords: model bias/fairness evaluation; ethical considerations in NLP applications; transparency;

Contribution Types: Model analysis & interpretability

Languages Studied: English

Keywords: model bias/fairness evaluation; ethical considerations in NLP applications; transparency;

Submission Number: 1115

Loading