Keywords: bias, robustness, reasoning models
Abstract: Large Reasoning Models (LRMs) such as DeepSeek-R1 and o1 are increasingly used as automated judges, but their susceptibility to the aesthetics of reasoning raises serious concerns. We present THEATER, a benchmark for systematically evaluating this vulnerability—termed Fake Reasoning Bias (FRB)—by comparing LRMs and general-purpose LLMs across subjective preference and objective factual tasks. Evaluating six bias types, including Simple Cues and Fake Chain-of-Thought, we report three key findings: (1) paradoxically, reasoning-specialized LRMs are more prone to FRB than LLMs, especially on subjective tasks; (2) this leads to a task-dependent trade-off, with LRMs more robust on factual tasks but weaker on subjective ones; and (3) shallow reasoning—plausible yet flawed arguments—emerges as the most potent form of deception. We further test two mitigation strategies: a targeted prompt that improves factual accuracy by up to 12\% but yields only marginal gains (1–3\%) on subjective tasks, and a self-reflection prompt that performs similarly. These results show that FRB is a persistent, deep-seated challenge for LRM-based evaluation, and highlight THEATER as a framework for building more reliable and trustworthy judging LRMs.
Submission Number: 6
Loading