Between the Bars: Gradient-based Jailbreaks are Bugs that induce Features

NeurIPS 2024 Workshop ATTRIB Submission97 Authors

Published: 30 Oct 2024, Last Modified: 14 Jan 2025ATTRIB 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Safety, Robustness, Interpretability, Attribution
Abstract:

Recent work has demonstrated that it is possible to find gradient-based jailbreaks (GBJs) that cause safety-tuned LLMs to output harmful text. Understanding the nature of these adversaries might lead to valuable insights for improving robustness and safety. We find that, even relative to very semantically similar baselines, GBJs are highly out-of-distribution for the model. Despite this, GBJs induce a structured change in models. We find that the activations of jailbreak and baseline prompts are separable with unsupervised methods alone. Using our understanding, we are able to steer models without training, both to be more or less susceptible to jailbreaking, and to output harmful text in response to harmful prompts without jailbreaking. Our findings suggest a picture of GBJs as "bugs" that induce more general "features" in the model: highly out of distribution text that induce understandable and potentially controllable changes in language models.

Submission Number: 97
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview