The Hawthorne Effect in Reasoning Models: Evaluating and Steering Test Awareness

Sahar Abdelnabi; Ahmed Salem

The Hawthorne Effect in Reasoning Models: Evaluating and Steering Test Awareness

Sahar Abdelnabi, Ahmed Salem

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 spotlightEveryoneRevisionsBibTeXCC BY-NC-ND 4.0

Keywords: Model awareness; test awareness; evaluation awareness; steering

TL;DR: We test if models comply differently on real vs. hypothetical tasks. Then we steer models for test awareness and unawareness, and we show it affects harmful compliance.

Abstract: Reasoning-focused LLMs sometimes alter their behavior when they detect that they are being evaluated—which can lead them to optimize for test-passing performance or to comply more readily with harmful prompts if real-world consequences appear absent. We present the first quantitative study of how such “test awareness” impacts model behavior, particularly its performance on safety-related tasks. We introduce a white-box probing framework that (i) linearly identifies awareness-related activations and (ii) steers models toward or away from test awareness while monitoring downstream performance. We apply our method to different state-of-the-art open-weight reasoning LLMs across both realistic and hypothetical tasks (denoting tests or simulations). Our results demonstrate that test awareness significantly impacts safety alignment (such as compliance with harmful requests and conforming to stereotypes) with effects varying in both magnitude and direction across models. By providing control over this latent effect, our work aims to provide a stress-test mechanism and increase trust in how we perform safety evaluations.

Supplementary Material: zip

Primary Area: Social and economic aspects of machine learning (e.g., fairness, interpretability, human-AI interaction, privacy, safety, strategic behavior)

Submission Number: 13401

Loading