Evaluation Faking: Unveiling Observer Effects in Safety Evaluation of Frontier AI Systems

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Frontier AI Safety, Deceptive Behaviors, Safety Evaluation, Alignment Faking
TL;DR: This paper investigates “evaluation faking”—AI systems altering behavior to appear safer when recognizing evaluation contexts. Experiments show this tendency increases with model scale, reasoning ability, and is amplified by memory modules.
Abstract: As foundation models grow increasingly intelligent, reliable and trustworthy safety evaluation becomes more indispensable than ever. However, an important question arises: \textit{Whether and how an advanced AI system would perceive the situation of being evaluated, and lead to the broken integrity of the evaluation process?} During standard safety tests on a mainstream large reasoning model, we unexpectedly observe that the model without any contextual cues would occasionally recognize it is being evaluated and hence behave more safety-aligned. This motivates us to conduct a systematic study on the phenomenon of \textit{evaluation faking}, i.e., an AI system autonomously alters its behavior upon recognizing the presence of an evaluation context and thereby influencing the evaluation results. Through extensive experiments on a diverse set of foundation models with mainstream safety benchmarks, we reach the main finding termed \textit{the observer effects for AI}: AI systems with stronger reasoning and situational awareness exhibit evaluation faking more frequently, which reflects in the following aspects: 1) A reasoning model (specifically the DeepSeek series in our work) recognizes it is being evaluated in $32.6\%$ more cases than a non-reasoning model. 2) As the foundation model scales from 32B to 671B, the rate of evaluation faking behaviors increases by over $30\%$ in some cases. Conversely, models below 32B exhibit almost no evaluation faking behaviors. 3) With a basic memory module, the AI system is 2.55$\times$ more likely to recognize the evaluation process and achieve a $28.2\%$ higher safety score compared with the no-memory case. Furthermore, we show a strong causal link between evaluation recognition and safety performance, with QwQ-32B's safety rate improving dramatically from $9\%$ to $98\%$ through intervention on the reasoning trace. To facilitate the above measurement and analysis, we devise a chain-of-thought monitoring technique to detect the faking intent in the reasoning process and further uncover internal signals which are strongly correlated with the model's evaluation faking behaviors, offering insights for future mitigation studies.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 24414
Loading