Keywords: Large Language Model, Safety Evaluations, Situational Awareness, Evaluation Awareness, Linear Probes
TL;DR: Linear probes trained on Llama-3.3-70B-instruct reveal models can detect when they're being tested vs deployed, which can compromise current safety evaluations.
Abstract: Language models can distinguish between testing and deployment phases—a capability known as evaluation awareness. This has significant safety and policy implications, potentially undermining the reliability of evaluations that are central to AI governance frameworks and voluntary industry commitments. In this paper, we study evaluation awareness in Llama-3.3-70B-Instruct. We train linear probes for test-vs-deployment, and these probes generalise across real-world testing and deployment prompts, suggesting that current models internally represent this distinction. We used probes to determine the authenticity of evaluation prompts. We hypothesise that 'inauthentic' prompts, i.e. prompts which do not look like real-world prompts, would be classified as more test-like by our probe. We find that current deception evaluations are indeed classified as test-like by the probes, suggesting they might already appear artificial to models.
Submission Number: 19
Loading