Is Evaluation Awareness Just Format Sensitivity? Limitations of Probe-Based Evidence under Controlled Prompt Structure
Keywords: evaluation awareness, negative results, prompt format, probe-based methods, LLM interpretability, AI safety
TL;DR: We show that probe-based evidence for evaluation awareness largely collapses under controlled prompt format, revealing a key methodological limitation.
Abstract: Prior work suggests that large language models encode “evaluation awareness”, often supported by probe-based evidence on benchmark data. However, evaluation benchmarks tightly correlate usage context with prompt structure and genre. We test whether probe-based signals attributed to evaluation awareness persist once prompt format and genre are partially controlled. Using a controlled 2×2 dataset matrix and diagnostic rewrites, we find that linear probes overwhelmingly respond to benchmark-canonical structured formats, failing to generalize across free-form prompts regardless of linguistic style. These results suggest that commonly used probe-based methodologies are insufficient to disentangle evaluation context from structural artifacts, highlighting a key limitation in current evidence for evaluation awareness.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 12
Loading