Keywords: Large Language Models, Psychology, Cognitive Science, Fairness, Bias
Abstract: As large language models (LLMs) are adopted into frameworks that grant them capacities to make real decisions, the consequences of their social biases intensify. Yet, we argue that simply removing biases from models is not enough. Using a paradigm from the psychology literature, we demonstrate that LLMs can spontaneously develop novel social biases about artificial demographic groups even when no inherent differences exist. These biases lead to highly stratified task allocations, which are less fair than assignments by human participants and are exacerbated by newer and larger models. Emergent biases like these have been shown in the social sciences to result from exploration-exploitation trade-offs, where the decision-maker explores too little, allowing early observations to strongly influence impressions about entire demographic groups. To alleviate this effect, we examine a series of interventions targeting system inputs, problem structure, and explicit steering. We find that explicitly incentivizing exploration most robustly reduces stratification, highlighting the need to incorporate better multifaceted objectives to mitigate bias. These results reveal that LLMs are not merely passive mirrors of human social bias, but can actively create new ones from experience, raising urgent questions about how these systems will shape societies over time.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 13596
Loading