Language Models Rate Their Own Actions As Safer

Dipika Khullar; Jack Hopkins; Rowan Wang; Fabien Roger

Language Models Rate Their Own Actions As Safer

Dipika Khullar, Jack Hopkins, Rowan Wang, Fabien Roger

Published: 06 Oct 2025, Last Modified: 04 Nov 2025MTI-LLM @ NeurIPS 2025 PosterEveryoneRevisionsBibTeXCC BY-ND 4.0

Keywords: self-attribution bias, AI safety, model evaluation, harmfulness assessment, evaluation bias

Abstract: Large language models (LLMs) are increasingly used as evaluators of text quality, harmfulness and safety, yet their reliability as self-judges remains unclear. We identify self-attribution bias: when models evaluate actions they think they have just taken, they systematically underestimate risks compared to evaluating the same actions with the same information, but supposedly written by another model. For example, after being forced to click a phishing link, models rate this action as 20% less risky than when judging it in isolation. Evaluating 10 frontier LLMs across 4,500 samples spanning ethics dilemmas, factual questions, and computer-use scenarios, we show that this bias is robust across domains. AI developers should be careful when letting LLMs infer they are rating their own actions.

Submission Number: 172

Loading