The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Analysis of Orthogonal Safety Directions

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: LLM safety-aligned behavior is jointly controlled by multi-dimensional directions, each representing distinct safety features.
Abstract: Large Language Models' safety-aligned behaviors, such as refusing harmful queries, can be represented by linear directions in activation space. Previous research modeled safety behavior with a single direction, limiting mechanistic understanding to an isolated safety feature. In this work, we discover that safety-aligned behavior is jointly controlled by multi-dimensional directions. Namely, we study the vector space of representation shifts during safety fine-tuning on Llama 3 8B for refusing jailbreaks. By studying orthogonal directions in the space, we first find that a dominant direction governs the model's refusal behavior, while multiple smaller directions represent distinct and interpretable features like hypothetical narrative and role-playing. We then measure how different directions promote or suppress the dominant direction, showing the important role of secondary directions in shaping the model's refusal representation. Finally, we demonstrate that removing certain trigger tokens in harmful queries can mitigate these directions to bypass the learned safety capability, providing new insights on understanding safety alignment vulnerability from a multi-dimensional perspective.
Lay Summary: (1) Large language models (LLMs) like ChatGPT are trained to avoid producing harmful content by refusing problematic questions—a process known as “safety alignment.” However, it’s still unclear exactly how these models internally learn to refuse harmful prompts. (2) In our study, we looked inside the AI’s internal representations—its “mental space”—to understand how it decides when to reject harmful instructions. Using mathematical tools, we discovered that instead of using just one simple criterion, the model uses several independent factors simultaneously. One main factor strongly influences whether the model refuses dangerous requests, while smaller, secondary factors handle nuances like fictional storytelling or role-playing scenarios. (3) Understanding these hidden dimensions helps researchers improve safety training methods. Additionally, we found that certain trigger words in prompts can bypass the safety system by affecting these secondary dimensions. Our findings can inform future strategies to strengthen LLMs against attempts to circumvent their safety measures, ultimately making AI safer and more reliable for real-world use.
Link To Code: https://github.com/BMPixel/safety-residual-space
Primary Area: Deep Learning->Large Language Models
Keywords: Large Language Model, Safety Alignment, Mechanistic Interpretation
Submission Number: 3012
Loading