Evaluating Superhuman Models with Consistency Checks

Lukas Fluri; Daniel Paleka; Florian Tramèr

Evaluating Superhuman Models with Consistency Checks

Lukas Fluri, Daniel Paleka, Florian Tramèr

Published: 23 Oct 2023, Last Modified: 28 Nov 2023SoLaR SpotlightEveryoneRevisionsBibTeX

Keywords: safety, security, trustworthy AI, evaluation, large language models, forecasting, robustness

TL;DR: Evaluating superhuman models, showing *some* of the decisions are wrong, in the absence of any ground truth, using consistency checks.

Abstract: If machine learning models were to achieve superhuman abilities at various reasoning or decision-making tasks, how would we go about evaluating such models, given that humans would necessarily be poor proxies for ground truth? In this paper, we propose a framework for evaluating superhuman models via consistency checks. Our premise is that while the correctness of superhuman decisions may be impossible to evaluate, we can still surface mistakes if the model's decisions fail to satisfy certain logical, human-interpretable rules. We investigate two tasks where correctness of decisions is hard to verify: due to either superhuman model abilities, or to otherwise missing ground truth: evaluating chess positions and forecasting future events. Regardless of a model's (possibly superhuman) performance on these tasks, we can discover logical inconsistencies in decision making: a chess engine assigning opposing valuations to semantically identical boards; or GPT-4 forecasting that sports records will evolve non-monotonically over time.

Submission Number: 61

Loading