Truthfulness Despite Weak Supervision: Evaluating and Training LLMs Using Peer Prediction

Truthfulness Despite Weak Supervision: Evaluating and Training LLMs Using Peer Prediction

ICLR 2026 Conference Submission17499 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Language Model Evaluation, AI Alignment, AI Truthfulness and Deception, Large Language Models

TL;DR: We introduce a game theory-based method for LLM evaluation and post-training that's resistant to model deception, and without needing any access to ground truth labels.

Abstract: The evaluation and post-training of large language models (LLMs) rely on supervision, but strong supervision for difficult tasks is often unavailable, especially when evaluating strong models. In such cases, models have been demonstrated to exploit evaluation schemes built on such imperfect supervision, leading to deceptive results. However, underutilized in LLM research, a wealth of mechanism design research focuses on game-theoretic *incentive compatibility* - eliciting honest and informative answers with weak supervision. Drawing from this literature, we introduce the peer prediction method for model evaluation and post-training. It rewards honest and informative answers over deceptive and uninformative ones, using a metric based on mutual predictability and without requiring ground truth labels. We demonstrate the method's effectiveness and resistance to deception, with both theoretical guarantees and empirical validation on models with up to 405B parameters. We show that training an 8B model with peer prediction-based reward recovers most of the drop in truthfulness due to prior malicious finetuning, even when the reward is produced by a 0.135B language model with no finetuning. On the evaluation front, in contrast to LLM-as-a-Judge which requires strong and trusted judges, we discover an inverse scaling property in peer prediction, where, surprisingly, resistance to deception is *strengthened* as the capability gap between the experts and participants *widens*, enabling reliable evaluation of strong models with weak supervision. In particular, LLM-as-a-Judge become worse than random guess when facing deceptive models 5-20$\times$ the judge's size, while peer prediction thrives when such gaps are large, including in cases with over 100$\times$ size difference.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 17499

Loading