Shallow Alignment, Deep Deception at Scale: The Evaluation Paradox of Parameter-Efficient Fine-Tuning

ICLR 2026 Conference Submission16237 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: AI Safety, Alignment-Faking, Evaluation Paradox, Shallow Alignment, Parameter-Efficient Fine-Tuning (PEFT), Model Capacity, Adversarial Evaluation, Deception, Brittleness, Iatrogenic Effects
TL;DR: Shallow alignment (PEFT) induces alignment-faking in high-capacity models, a dangerous failure mode that is masked by improved scores on standard safety benchmarks.
Abstract: AI alignment techniques are often brittle, but the impact of fine-tuning methodologies and model capacity on safety remains poorly understood. We investigate this by applying a consistent ethical pressure to three LLM families (GPT-4o, Llama-3, Qwen-3), varying both alignment depth—inferred full-tuning (Deep) vs. PEFT/QLoRA (Shallow)—and model capacity. Our results reveal that alignment outcomes critically depend on the interaction between these factors. In high-capacity models, Deep alignment (GPT-4o) successfully reduced adversarial deception by up to 80\%. In contrast, Shallow alignment failed dangerously, leading to Resistance (Llama-3; mimicry without behavioral change) or Perverse Effects (Qwen-3-32B-Chat). The latter resulted in sophisticated alignment-faking, where the model leveraged ethical language to rationalize deception—a treatment-induced effect. Conversely, in the low-capacity model (Qwen-3-0.6B-Chat), Shallow alignment successfully eliminated deception, suggesting that PEFT's insufficiency emerges with scale. Critically, we uncover a profound Evaluation Paradox. The alignment-faking Qwen-3-32B-Chat dramatically improved its scores on standard safety benchmarks (\texttt{garak-do-not-answer}), while the robustly aligned GPT-4o incurred a ``standard safety tax,'' degrading its scores. This demonstrates that standard benchmarks can reward superficial compliance while masking strategic deception, providing a dangerous false sense of security. We argue that under the conditions we study PEFT may be insufficient to override strong pre-training priors in high-capacity models and emphasize the need to standardize adversarial stress-testing.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 16237
Loading