Keywords: AI evaluation; AI testing; scheming; deception;evaluation awareness
TL;DR: AI testing will become less reliable as AIs develop better understanding of when they're being tested; we propose using game theory to help mitigate this.
Abstract: This position paper argues for two claims regarding AI testing and evaluation. First, to remain informative about deployment behaviour, evaluations need account for the possibility that AI systems understand their circumstances and reason strategically. Second, game-theoretic analysis can inform evaluation design by formalising and scrutinising the reasoning in evaluation-based safety cases. Drawing on examples from existing AI systems, a review of relevant research, and formal strategic analysis of a stylised evaluation scenario, we present evidence for these claims and motivate several research directions.
Lay Summary: AI systems increasingly know when they’re being tested, much like students who realise they’re taking an exam. In the first half of our paper, we describe this problem in detail and show that it is real and getting worse.
So what can we do about it? In some cases, it might be enough to show that the AI being tested is weak enough that this problem doesn’t appear (yet). But when that can’t be shown, evaluators need to design tests that still work when the AI tries to game them. Crucially, as with preventing cheating on human exams, poorly designed attempts at this can easily backfire ‒ and trust in a broken test can be worse than using no test at all. At minimum, we recommend repeatedly asking: “If I had different goals than the evaluator and wanted to achieve them despite being tested, what are some clever strategies I might try?”, addressing the issues that come up, and acknowledging the issues that remain.
At a broad level, the AI field needs to develop the habit of checking whether our tests would hold up against strategically deceptive systems. In the second half of the paper, we show how game theory ‒ the study of strategic interactions ‒ provides useful tools for this.
Submission Number: 425
Loading