Keywords: LLM, evaluations, evaluation awareness
TL;DR: Systematic comparative analysis and testing of different methods to measure evaluation awareness of LLMs, as well as a tool for doing so for evaluations developers
Abstract: LLMs are sometimes aware of being evaluated. As a result, they might behave differently in evaluations compared to real-world scenarios. To investigate this phenomenon, we first need to properly measure it. Recently, a number of papers that measure evaluation awareness have been published, but they all measure it in different ways that are hard to compare. This work provides a systematic comparison of these methods, as well as introduces several new ones. It compares them using the same diverse dataset of LLM-user interactions, and analyses the resulting data in-depth. Building on these findings, it introduces a taxonomy of prompt features that cause LLMs to classify prompts as evaluations, and a practical tool for eliciting such features for any evaluations. These findings might help to create more trustworthy and realistic evaluations that LLMs are unable to distinguish from real-world tasks.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 4186
Loading