Probe-Rewrite-Evaluate: A Workflow for Reliable Benchmarks and Quantifying Evaluation Awareness

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: AI Alignment, Evaluation Awareness, Situational Awareness, Large Language Models (LLMs), Model Behavior, Linear probes, Prompt Rewriting, Mechanistic intepretability
TL;DR: We introduce a "probe, rewrite, re-evaluate" framework that uses a linear probe to turn benchmark prompts into more realistic, deployment-style scenarios, allowing us to quantify evaluation awarness behavior change across a suite of SOTA models.
Abstract: Benchmarks often overestimate LLM trustworthiness because models behave differently under evaluation than in real-world use. We present Probe–Rewrite–Evaluate (PRE), a training-free diagnostic pipeline that reveals how large language models (LLMs) alter their behavior when prompts shift from test-like to deploy-like contexts, a phenomenon known as evaluation awareness. PRE first applies a linear probe to assign each prompt a continuous realism score, then uses a semantics preserving rewriting strategy to increase deploy likeness, and finally evaluates paired outputs with an external judge model. On a strategic role playing dataset of 371 items, PRE raises average probe scores by 30% after rewriting while maintaining task intent. Across state-of-the-art models, deploy-like prompts reliably change outcomes: honesty increases by 12.63\%, deception decreases by -25.49%, and refusals rise by 12.82%, with Claude 4.1 Opus showing the largest single-model reduction in deception by 29.11%. These shifts are statistically significant under paired tests and correlate with the magnitude of probe-score gains, demonstrating that evaluation awareness is not only measurable but manipulable. Additionally, we provide a quantification of LLM evaluation awareness through an awareness elasticity score. Our findings highlight that LLMs are more prone to unsafe or deceptive outputs under perceived test conditions, underscoring the need for benchmark frameworks that explicitly account for prompt realism when assessing alignment.
Submission Number: 198
Loading