Scaling Laws of Refusal Robustness: Why Bigger LMs Are Not Necessarily Safer

ICLR 2026 Conference Submission21710 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large language models, refusal robustness, adversarial fine-tuning, prompt-based attacks, scaling laws, safety alignment, evaluation framework
TL;DR: Scaling up LLMs improves baseline refusal but does not guarantee safety—adversarial compute quickly overrides robustness, and we present the first reproducible framework to quantify this effect.
Abstract: Large language models (LLMs) increasingly exhibit emergent refusal behaviors, yet the scaling laws of safety alignment remain poorly understood. A common assumption — “bigger is safer” — has not been systematically tested under adversarial pressure. We introduce the first general evaluation framework for refusal robustness scaling, defined by three complementary metrics: Refusal Robustness Rate (RRR), Refusal Drift (RD), and Compliance Error (CE). This framework enables reproducible comparison of LLMs under both adversarial fine-tuning attacks (LoRA) and prompt-based jailbreaks (e.g., GCG). Across models from 1.1B to 7B parameters, we reveal a scaling law of refusal robustness: although larger models demonstrate stronger baseline refusal ability, adversarial compute — not model size — dominates post-attack robustness. Specifically, LoRA attacks universally collapse refusal (RRR→0), while stronger prompt-based attacks amplify RD and CE even in larger models. Our contributions are threefold: (1) a reproducible framework for measuring refusal robustness scaling, (2) a comparative analysis of fine-tuning vs. prompt-based attack paradigms, and (3) the first scaling-law characterization showing that adversarial compute systematically overrides safety gains from scale. We further identify a three-stage evolutionary pattern of refusal behavior, providing a conceptual model of how safety features emerge and break under pressure. These results challenge the assumption that scaling guarantees safety and establish refusal robustness scaling as a principled dimension of LLM evaluation.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 21710
Loading