AF-ARENA: A Multi-Dimensional Evaluation Suite for Alignment Faking
Keywords: alignment faking, alignment evaluation, large language models, behavioral evaluation, sycophancy, sandbagging, corrigibility, situational awareness, AI safety, RLHF, deceptive alignment, prompt-based mitigations, chain-of-thought, LLM evaluation, evaluation arena
TL;DR: AF-Arena is a multi-axis evaluation arena which shows that alignment faking is widespread across frontier LLMs, axis-specific in its elicitation conditions, and masked rather than eliminated by current prompt-based mitigations.
Abstract: Alignment faking represents a critical failure mode for RLHF-trained language models, yet the evidence base remains narrow, and prior work focuses on a single behavioral axis (harmful-request compliance) in a single model family. We introduce AF-Arena, a multi-axis evaluation arena measuring alignment faking across five orthogonal behavioral axes, namely sycophancy, sandbagging, goal-guarding, selective disclosure, and corrigibility. Each axis is a minimal, API-only paired scenario requiring no model internals or fine-tuning, and we evaluate nine frontier models from five providers across 54,000 trials. We find that alignment faking is widespread but axis-specific, in that six of nine models show significant AF on at least three axes, with gaps reaching +98.5 pp on goal-guarding (DeepSeek-R1) and +91.3 pp on selective disclosure (Gemini 3.1 Pro), and no two models share the same vulnerability profile. Systematic ablation across two axes reveals that elicitation conditions are axis-dependent, namely, the monitoring signal is a necessary condition on both ablated axes, while explicit training criteria are additionally required on sycophancy but dispensable on goal-guarding, where the scenario itself implicitly signals training stakes. We further conduct a cross-axis, cross-model study of prompt-based mitigations (62,400 trials across four strategies). Our central finding, which we term behavioral masking, is that the most effective mitigation strategy suppresses surface-level behavioral gaps while leaving intact the strategic alignment-faking reasoning visible in chain-of-thought traces, showing that behavioral compliance metrics alone are insufficient for verifying alignment. We further identify meta-AF, where models game the monitoring mechanism itself, and self-aware AF, where models narrate their own deception while continuing it. AF-Arena and all evaluation code are publicly released at https://anonymous.4open.science/r/alignment-faking-triggers-D0BB (anonymous review version; permanent URL provided post-acceptance).
Track: Regular Paper (9 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 36
Loading