Is Evaluation Awareness Just Format Sensitivity? Limitations of Probe-Based Evidence under Controlled Prompt Structure

Viliana Devbunova

Is Evaluation Awareness Just Format Sensitivity? Limitations of Probe-Based Evidence under Controlled Prompt Structure

Viliana Devbunova

Published: 02 Mar 2026, Last Modified: 07 Mar 2026ICLR 2026 Workshop ICBINBEveryoneRevisionsCC BY 4.0

Keywords: evaluation awareness, negative results, prompt format, probe-based methods, LLM interpretability, AI safety

TL;DR: We show that probe-based evidence for evaluation awareness largely collapses under controlled prompt format, revealing a key methodological limitation.

Abstract: Prior work suggests that large language models encode “evaluation awareness”, often supported by probe-based evidence on benchmark data. However, evaluation benchmarks tightly correlate usage context with prompt structure and genre. We test whether probe-based signals attributed to evaluation awareness persist once prompt format and genre are partially controlled. Using a controlled 2×2 dataset matrix and diagnostic rewrites, we find that linear probes overwhelmingly respond to benchmark-canonical structured formats, failing to generalize across free-form prompts regardless of linguistic style. These results suggest that commonly used probe-based methodologies are insufficient to disentangle evaluation context from structural artifacts, highlighting a key limitation in current evidence for evaluation awareness.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 12

Loading