Keywords: Robustness, Scale, Language Model, LLM, Adversarial Training, Transfer
TL;DR: Bigger models aren't necessarily more robust out-of-the-box, but they learn faster and better from adversarial training.
Abstract: Language model capabilities predictably improve from scaling the model's size and training data.
Motivated by this, increasingly large language models have been trained, yielding an array of impressive capabilities.
Yet these models suffer from adversarial prompts such as ``jailbreaks'' that hijack models to perform undesired behavior, posing a significant risk of misuse.
Prior work has found that computer vision models become more robust with model and data scaling, raising the question: does language model robustness also improve with scale?
We study this question empirically, finding that larger models respond substantially more effectively to adversarial training, but there is little to no benefit from model scale in the absence of defenses.
Submission Number: 42
Loading