Prior Beliefs Prejudice LLM-as-Judge: Evidence from Persuasion Evaluation

Pardis Sadat Zahraei; Xiaoning Wang; Nimet Beyza Bozdag; Gokhan Tur; Dilek Hakkani-Tür

Prior Beliefs Prejudice LLM-as-Judge: Evidence from Persuasion Evaluation

Pardis Sadat Zahraei, Xiaoning Wang, Nimet Beyza Bozdag, Gokhan Tur, Dilek Hakkani-Tür

Published: 28 Apr 2026, Last Modified: 28 Apr 2026MSLD 2026 PosterEveryoneRevisionsCC BY 4.0

Keywords: LLM-as-judge, persuasion evaluation, alignment bias, prior prejudice, bias detection, persuasion-as-probe, content moderation, truth-rhetoric conflation

TL;DR: LLMs systematically conflate their alignment-instilled beliefs with rhetorical persuasiveness which we exploit to reveal hidden biases via persuasion-as-probe.

Abstract: Large Language Models (LLMs) are increasingly used as judges to evaluate text quality, moderate content, and assess arguments. We investigate whether the prior beliefs instilled through alignment training influence LLM judgments when they serve as evaluators. We select persuasion evaluation as a representative task and test whether LLMs can objectively assess persuasive arguments or are prejudiced by their prior beliefs. We find a systematic failure we call prior prejudice: models conflate their training-instilled beliefs with rhetorical quality, rating identical claims vastly differently based on alignment with trained beliefs rather than argumentative merit. A bare assertion aligned with the model's training receives higher scores than a well-crafted argument opposing those beliefs, even when explicitly instructed to judge rhetoric alone. We introduce ConvinceQA, a dataset of 27,756 persuasive arguments with controlled stance variation spanning subjective, harmful, and misinformation domains, and demonstrate this prior prejudice across models. We exploit this failure through persuasion-as-probe: by evaluating minimal pairs that differ only in the subject token, we bypass learned refusals and reveal hidden biases. Analysis of model reasoning identifies three failure modes, with belief-conditioned rating inflation accounting for 88% of cases. Our findings reveal a fundamental limitation: alignment succeeds at controlling beliefs but fails at preserving the meta-cognitive ability to separate those beliefs from impartial evaluation tasks.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 110

Loading