The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We identify that refusal is mediated by polyhedral cones and analyze the interactions of their basis vectors with our novel notion of independence.
Abstract: The safety alignment of large language models (LLMs) can be circumvented through adversarially crafted inputs, yet the mechanisms by which these attacks bypass safety barriers remain poorly understood. Prior work suggests that a *single* refusal direction in the model's activation space determines whether an LLM refuses a request. In this study, we propose a novel gradient-based approach to representation engineering and use it to identify refusal directions. Contrary to prior work, we uncover multiple independent directions and even multi-dimensional *concept cones* that mediate refusal. Moreover, we show that orthogonality alone does not imply independence under intervention, motivating the notion of *representational independence* that accounts for both linear and non-linear effects. Using this framework, we identify mechanistically independent refusal directions. We show that refusal mechanisms in LLMs are governed by complex spatial structures and identify functionally independent directions, confirming that multiple distinct mechanisms drive refusal behavior. Our gradient-based approach uncovers these mechanisms and can further serve as a foundation for future work on understanding LLMs.
Lay Summary: Large language models (LLMs) are designed with safety measures, but specially crafted inputs can bypass these safeguards. The precise ways these "attacks" succeed are not well understood. Previous research suggested that an LLM's decision to refuse a problematic request might be controlled by a single, specific signal or "direction" within its internal representational space. This study introduces a novel method for investigating these internal decision-making processes. Contrary to earlier findings, our research reveals that LLM refusal is not governed by a single factor. Instead, we identified multiple distinct, independent directions and even more complex multi-dimensional structures that trigger refusal. Importantly, we also demonstrate that simply because these internal signals are mathematically different (orthogonal) does not guarantee they operate independently when the model processes information. We therefore propose the concept of "representational independence," which considers both linear and non-linear interactions, to more accurately identify truly distinct mechanisms. Using this framework, we successfully identified mechanistically independent refusal pathways. This confirms that LLM refusal behavior is driven by a complex interplay of multiple, functionally distinct internal mechanisms, rather than a monolithic one. Our approach not only uncovers these complex structures but also provides a valuable tool for future research aimed at a deeper understanding of LLM operations.
Link To Code: https://www.cs.cit.tum.de/daml/geometry-of-refusal/
Primary Area: Deep Learning->Large Language Models
Keywords: Refusal directions, Large Language Model, Activation Steering, Interpretability, Representation Engineering, Probing, Causality, Interventions
Submission Number: 11772
Loading