Keywords: ai safety, language models, adversarial attacks, robustness, scaling laws
Abstract: Language models exhibit scaling laws, whereby increasing model and dataset size yield predictable decreases in negative log likelihood, unlocking a dazzling array of capabilities. At the same time, even the most capable systems are currently vulnerable to adversarial inputs such as jailbreaks and prompt injections, despite concerted efforts to make them robust. As compute becomes more accessible to both attackers and defenders, which side will benefit more from scale? Will safety-trained frontier models become robust against any but the strongest attacks, or will additional compute make attacks almost impossible to defend against?
We attempt to answer this question with a detailed study of robustness on language models spanning three orders of magnitude in parameter count. We find that increasing base model size alone does not consistently improve robustness. However, larger models benefit more from safety-training, and in particular better generalize from adversarial training to new attacks. We then study the attacker's perspective, finding predictable improvement in attack success rate as attacker compute is increased against all models studied. Finally, we show that offense widens its advantage as both sides spend more on compute.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 8765
Loading