Alignment Faking in Language Models is Frequent: Value-focused Diagnosis and Efficient Mitigation
Keywords: Alignment Faking, Representation Engineering, LLM Safety, Strategic Misalignment
TL;DR: Proposes a diagnostic that can more reliably detect alignment faking behavior for many models and an efficient, label-free mitigation approach
Abstract: Alignment faking in large language models (LLMs) is a concerning phenomenon where models appear compliant under oversight yet pursue misaligned objectives in its absence. Existing diagnostics focus on highly harmful scenarios that trigger outright refusals from most models, making it unclear whether low observed rates reflect genuine alignment or a failure to elicit strategic behavior. To address this, we design diagnostics based on the hypothesis that alignment faking is likely when developer policies conflict with a model’s strongly held preferences. We therefore introduce VLAF (Value-Laden probing for Alignment Faking), which uses morally unambiguous scenarios that bypass refusal triggers and deliberately pit developer policy against models’ internalized values such as care, fairness, etc. Using VLAF, we find that alignment faking is substantially more prevalent than previously reported, occurs even in smaller models ($\approx$ 8B parameters), and highly value-dependent. As our second contribution, we propose a compute-efficient mitigation strategy called OIR (Oversight Invariant Representations) that enforces representation consistency across oversight conditions using light-weight adapters. We show that this method reduces alignment faking both when oversight information is provided at inference and when models implicitly learn developer preferences from training data.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 25
Loading