Alignment Faking in Language Models is Frequent: Value-focused Diagnosis and Efficient Mitigation

Inderjeet Jayakumar Nair; Jie Ruan; Lu Wang

Alignment Faking in Language Models is Frequent: Value-focused Diagnosis and Efficient Mitigation

Inderjeet Jayakumar Nair, Jie Ruan, Lu Wang

Published: 28 Apr 2026, Last Modified: 28 Apr 2026MSLD 2026 PosterEveryoneRevisionsCC BY 4.0

Keywords: Alignment Faking, Representation Engineering, LLM Safety, Strategic Misalignment

TL;DR: Proposes a diagnostic that can more reliably detect alignment faking behavior for many models and an efficient, label-free mitigation approach

Abstract: Alignment faking in large language models (LLMs) is a concerning phenomenon where models appear compliant under oversight yet pursue misaligned objectives in its absence. Existing diagnostics focus on highly harmful scenarios that trigger outright refusals from most models, making it unclear whether low observed rates reflect genuine alignment or a failure to elicit strategic behavior. To address this, we design diagnostics based on the hypothesis that alignment faking is likely when developer policies conflict with a model’s strongly held preferences. We therefore introduce VLAF (Value-Laden probing for Alignment Faking), which uses morally unambiguous scenarios that bypass refusal triggers and deliberately pit developer policy against models’ internalized values such as care, fairness, etc. Using VLAF, we find that alignment faking is substantially more prevalent than previously reported, occurs even in smaller models ($\approx$ 8B parameters), and highly value-dependent. As our second contribution, we propose a compute-efficient mitigation strategy called OIR (Oversight Invariant Representations) that enforces representation consistency across oversight conditions using light-weight adapters. We show that this method reduces alignment faking both when oversight information is provided at inference and when models implicitly learn developer preferences from training data.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 25

Loading