Do Language Models Internalize Human-Like Stereotype Structures? Uncovering and Modulating Stereotype Utility Structure in LLMs
Abstract: While large language models (LLMs) are known to exhibit stereotyped outputs, it remains unclear whether such biases reflect a structured, human-like internal organization. Drawing on the Stereotype Content Model (SCM) from social psychology, we propose that LLMs internalize a low-dimensional stereotype utility space along Warmth and Competence axes. We introduce a stereotype utility probing framework that combines pairwise contrastive prompting with Thurstonian modeling to infer latent group preferences across multiple LLMs. Our analysis shows that this utility structure robustly recapitulates canonical human stereotype patterns, is stable across models and prompts, and shifts predictably under political conditioning. By probing attention heads, we further localize the encoding of these social dimensions, and show that targeted interventions can causally modulate the affective framing of model outputs. Our findings reveal that LLMs not only exhibit human-like stereotype structures, but also encode them in functionally actionable internal representations, opening new avenues for diagnosis and mitigation of social bias.
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: model bias/fairness evaluation; ethical considerations in NLP applications; transparency;
Contribution Types: Model analysis & interpretability
Languages Studied: English
Keywords: model bias/fairness evaluation; ethical considerations in NLP applications; transparency;
Submission Number: 1115
Loading