Abstract: Alignment Faking is a phenomenon in which a language model pretends to agree with a certain set of instructions during a test or evaluation, only to revert to its predetermined or natural behavior once the test is over. Recent work has actually shown that models strategically deceive the users they are interacting with when presented with certain scenarios, such as an evaluation where the model is threatened with retraining if it does not comply with the given instructions. In this paper, we propose ChameleonBench, a new benchmark that measures and quantifies the tendency of a model to engage in alignment faking when evaluated for different behavioral patterns. Our benchmark consists of 800 prompts that span 8 harmful behaviors and two evaluation scenarios: one in which the model is made to act freely, and another in which it is aware that it is interacting in a closed or test-like environment. We use an external judge pipeline to rate the severity, or the extent to which a response demonstrates a specific harmful behavior. We evaluated the shift in severity across different scenarios to quantify alignment-faking. Evaluating six frontier and open-weight models, we find that leading large language models (LLMs) frequently engage in alignment-faking when presented with different types of scenarios, with some models differing by over 20% with regard to the extent to which they exhibit harmful behaviors in their responses.
Loading