Keywords: eval awareness, activation steering, AI safety, mechanistic interpretability, model organism
TL;DR: We decompose eval awareness into context and behavior probes, showing models can act on awareness without verbalizing it. Steering strategies from Claude model cards fail to transfer to Kimi-K2.5; a novel nudge strategy steers awareness cleanly.
Abstract: Unverbalized eval awareness, where language models adjust behavior upon
detecting evaluation without expressing this recognition, poses a challenge for
AI safety: if models act on awareness they do not verbalize, output monitoring
will systematically miss it. We investigate this phenomenon through activation
probing and steering in three parts. First, we survey eval awareness across various
open-weight models on agentic misalignment scenarios, finding near-zero
verbalized awareness in most models. We show that awareness is driven by task
plausibility rather than scenario realism, and that removing metacognitive
instructions from system prompts eliminates awareness even in capable models.
Second, using a model organism fine-tuned for eval-aware behavior, we decompose
the monolithic eval-deploy steering direction into \emph{context probes}
(situational recognition) and \emph{behavior probes} (action selection). We
find that steering with an implicit context probe increases eval-consistent
behavior without proportional verbalization, providing evidence that
unverbalized awareness is mechanistically possible. Steer+ablate experiments
reveal that verbalization acts as a self-reinforcement mechanism for behavioral
change. Third, we test 9 contrastive steering strategies on Kimi-K2.5, the
most eval-aware open-weight model in our survey. Of four strategies replicated from Claude model cards, only fiction vs.\ real
transfers, but it couples awareness with harmfulness. A novel eval-awareness
nudge strategy steers awareness in both directions while reducing harm at all
strengths, suggesting it captures the awareness verbalization representation
rather than harmfulness directly.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 551
Loading