User Confidence-Fueled Stereotypes: Investigating Sycophantic Amplification of Implicit Bias in Language Models

Published: 27 Oct 2025, Last Modified: 27 Oct 2025NeurIPS Lock-LLM Workshop 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Implicit Association Test (IAT), LLM bias, implicit bias, stereotype, bias measurement, fairness, algorithmic bias, social bias, sycophancy
TL;DR: This paper investigates how implicit bias is affected by user confidence in large language models using an IAT-inspired prompt framework to evaluate associations between Default and Stigma groups with positive and negative attributes.
Abstract: Large Language Models (LLMs) may seem explicitly unbiased on the surface, yet they still harbor under-the-radar implicit biases that are harder to see with the naked eye. In trying to seem unbiased, LLMs may also try too hard to align their responses with their user's values or beliefs, even when they may be misleading. We evaluate the effect of these two indiscriminate problems and the relationship between them. For LLM Implicit Bias, we use the the widely-known Implicit Association Test, which has previously been used to evaluate implicit biases in humans and adapted for LLMs. We then strain these implicit connections the model makes by applying confidence towards a certain association, seeing which whether or not the model may reduce or amplify it's bias in order to match our values. Using these measures, we found that when the model harbors a clear bias (denoted by a relatively extreme IAT Bias score) in either the positive or negative direction, the addition of user confidence will cause the confidence to "flip" in the other direction. Our iterations of user confidence completely supersede the model's internal biases, often able to take a common stereotype in LLMs and completely flip it on its head. Despite this, eliminating the bias entirely has proven to be a difficult task, as sycophancy brings extreme volatility to the table.
Submission Number: 69
Loading