Keywords: NLP, LLM, Bias
Abstract: Current evaluation paradigms for Large Language Model (LLM) bias are predominantly dominated by static evaluation frameworks, which measure model bias levels through pre-constructed benchmark datasets. However, these methods struggle to capture emergent bias—biases that surface dynamically during intense, multi-turn adversarial dialogues and complex interaction scenarios. This paper proposes EMBER, a framework for the evaluation and mitigation of emergent bias in multi-turn adversarial dialogues, implementing a multi-agent system to systematically address these challenges.
Experimental results reveal two core conclusions: (1) Multi-turn adversarial dialogues significantly stimulate the emergent bias of LLMs, and bias evolution exhibits "dynamic periodicity" characteristics, with distinct bias response patterns across different models; (2) Traditional initial-injection prompt mitigation strategies are only effective in the initial stages without strong adversarial stimulation; under sustained viewpoint shocks, their mitigation effect decays rapidly and may even trigger a "defensive bias reinforcement" phenomenon. The results highlight the complexity of bias mitigation in adversarial scenarios and offer key insights for optimizing subsequent mitigation strategies.
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: model bias/fairness evaluation,model bias/unfairness mitigation
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 8200
Loading