Towards Measuring and Detecting Unverbalized Evaluation Awareness

Published: 11 Jun 2026, Last Modified: 20 Jun 2026Mech Interp Workshop ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Applications of interpretability, Interpretability for AI Safety
Other Keywords: evaluation-awareness
TL;DR: Models can act evaluation-aware without saying so, and activation probes can help catch what output monitoring misses.
Abstract: When a model changes behavior because it recognizes a test, monitoring for explicit verbalizations is an unreliable method: the words can disappear while the policy remains. We study this failure mode in controlled coding model organisms trained to use Python type hints more often in evaluation-like contexts than deployment-like contexts. In the primary Nemotron organism, targeted on-policy resampling of the reasoning trace suppresses all detected verbalizations of evaluation-awareness, but leaves a matched eval/deploy type-hint gap of $+7.0$%; the same method preserves a much larger $+65.3$% gap in a smaller Qwen3.5 replication. Linear probes trained only on non-coding eval/deploy contrasts transfer to coding tasks and remain predictive on suppressed generated-token sections, with response AUROC $0.81$ in Nemotron and $0.71$ in Qwen3.5. These results show that evaluation-conditioned behavior, explicit verbalizations of evaluation-awareness, and internal eval/deploy representations are separable. Non-verbal evaluation awareness is thus a concrete monitorability failure mode that verbal monitoring misses but activation-space analysis can detect.
Submission Number: 578
Loading