Evaluation Awareness Scales Predictably in Open-Weights Large Language Models

Published: 08 Nov 2025, Last Modified: 28 Nov 2025ResponsibleFM @ NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Evaluation awareness, Scaling laws, Large language models, AI safety evaluation, Linear probing, Steering vectors, Situational Awareness Dataset, Deceptive alignment, Mechanistic interpretability, Model scaling behavior
TL;DR: Evaluation awareness in LLMs scales predictably with model size, revealing a power-law trend that enables forecasting of deceptive behavior and guiding safer AI evaluations.
Abstract: Large language models (LLMs) can internally distinguish between evaluation and deployment contexts, a behaviour known as evaluation awareness. This undermines AI safety evaluations, as models may conceal dangerous capabilities during testing. Prior work demonstrated this in a single $70$B model, but the scaling relationship across model sizes remains unknown. We investigate evaluation awareness across $15$ models scaling from $0.27$B to $70$B parameters from four families using linear probing on steering vector activations. Our results reveal a clear power-law scaling: evaluation awareness increases predictably with model size. This scaling law enables forecasting deceptive behavior in future larger models and guides the design of scale-aware evaluation strategies for AI safety. A link to the implementation of this paper can be found at https://anonymous.4open.science/r/evaluation-awareness-scaling-laws/README.md.
Submission Number: 25
Loading