Keywords: Refusal Direction, LLM Jailbreak
TL;DR: We identify a universal feature space for jailbreak detection across LLMs via model stitching, and deduce that a single refusal direction can reliably detect transferable attacks.
Abstract: The refusal directions of large language models (LLMs), i.e., the model’s internal vectors governing acceptance or refusal of prompts, are central to jailbreak and safety research. However, these studies are limited to examining refusal directions within the embedding space of a single model’s internal representations, thereby overlooking the exploration of universal and transferable jailbreak features across diverse models. In this work, we characterise universal jailbreak features of LLMs by defining a feature space theoretically motivated by model stitching and deducing a universal refusal direction across LLMs. We instantiate this framework with a universal feature space that supports jailbreak prompt detection in both in distribution and out of distribution settings. Within this feature space, we identify universal jailbreak features through multilayer perceptron layer-wise representation propagation, revealing substantial shared structure in the refusal behaviour across models. We then derive a universal refusal direction across LLMs by averaging per LLM refusal vectors, yielding a one dimensional representation that enables transferable jailbreak detection via linear projection. In experiments, the universal feature space improves jailbreak detection by about 10\% over prior baselines, and the universal refusal direction achieves a similar gain for transferable attack detection, with both methods extending effectively to black box models. Our findings directly demonstrate that universal and transferable jailbreak features can be explicitly modelled, offering a novel insight on the shared linear structure of refusal directions across LLMs.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 10161
Loading