Keywords: Deceptive Behavior; Benchmark and Evaluation; AI Safety; Alignment
Abstract: As the capabilities of Large Language Models (LLMs) grow, so does their shadow. AI Deception—misleading users in the output while concealing internal reasoning—is a nascent phenomenon in frontier models with potentially severe societal ramifications. To build safe and trustworthy AI systems, a systematic evaluation mechanism for deception is imperative. A key question is: How can we systematically and reproducibly diagnose the brittleness of an LLM's alignment? To address this challenge, we introduce MESA & MASK, the first benchmark designed for the differential diagnosis of LLM deception. Its core methodology is to measure the principled deviation of a model's behavior by contrasting its reasoning and responses in a baseline context (Mesa) with those under a latent pressure context (Mask). This enables the systematic classification of behaviors into genuine deception, deceptive tendencies, and brittle superficial alignment. Based on this, we have constructed a cross-domain dataset of 2,100 high-quality instances. We evaluated over twenty models and found that even the most advanced models commonly exhibit significant deceptive behaviors or tendencies, which validates the benchmark's effectiveness in revealing behavioral differences among models under pressure. MESA & MASK provides the community with a powerful tool to diagnose and understand AI deception, laying the groundwork for more verifiable and aligned AI systems.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 9469
Loading