Keywords: Adversarial Robustness, Interpretability, Jailbreaks, Safety
TL;DR: Our work suggests a picture of gradient-based jailbreaks as "bugs" that induce more general "features" in the model: highly out of distribution text that induce understandable and potentially controllable changes in language models.
Abstract: Recent work has demonstrated that it is possible to find gradient-based jailbreaks (GBJs) that cause safety-tuned LLMs to output harmful text. Understanding the nature of model computation on these adversaries is essential for better robustness and safety. We find that these GBJs are highly out-of-distribution for the model, even relative to very semantically similar baselines, and that they are easily separated from these baselines with unsupervised clustering alone. Despite this, GBJs have elements of structure. We find that attacks are semantically similar to their induced outputs, and that the distribution of model features. Using our understanding, we are able to steer models without training, both to be more or less susceptible to jailbreaking, and to output harmful text in response to harmful prompts without jailbreaking. Our findings suggest a picture of GBJs as "bugs" that induce more general "features" in the model: highly out of distribution text that induce understandable and potentially controllable changes in language models.
Serve As Reviewer: uzgirit@gmail.com
Submission Number: 52
Loading