Toward Dealing with Unverbalized Eval Awareness

Published: 03 Jun 2026, Last Modified: 03 Jun 2026AI4GOOD Workshop 2026 RegularEveryoneRevisionsBibTeXCC BY 4.0
Keywords: eval awareness, activation steering, AI safety, mechanistic interpretability, model organism
TL;DR: We decompose eval awareness into context and behavior probes, showing models can act on awareness without verbalizing it. Steering strategies from Claude model cards fail to transfer to Kimi-K2.5; a novel nudge strategy steers awareness cleanly.
Abstract: Unverbalized eval awareness, where language models adjust behavior upon detecting evaluation without expressing this recognition, poses a challenge for AI safety: if models act on awareness they do not verbalize, output monitoring will systematically miss it. We investigate this phenomenon through activation probing and steering in three parts. First, we survey eval awareness across various open-weight models on agentic misalignment scenarios, finding near-zero verbalized awareness in most models. We show that awareness is driven by task plausibility rather than scenario realism, and that removing metacognitive instructions from system prompts eliminates awareness even in capable models. Second, using a model organism fine-tuned for eval-aware behavior, we decompose the monolithic eval-deploy steering direction into \emph{context probes} (situational recognition) and \emph{behavior probes} (action selection). We find that steering with an implicit context probe increases eval-consistent behavior without proportional verbalization, providing evidence that unverbalized awareness is mechanistically possible. Steer+ablate experiments reveal that verbalization acts as a self-reinforcement mechanism for behavioral change. Third, we test 9 contrastive steering strategies on Kimi-K2.5, the most eval-aware open-weight model in our survey. Of four strategies replicated from Claude model cards, only fiction vs.\ real transfers, but it couples awareness with harmfulness. A novel eval-awareness nudge strategy steers awareness in both directions while reducing harm at all strengths, suggesting it captures the awareness verbalization representation rather than harmfulness directly.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 551
Loading