Learning Safety Constraints for Large Language Models

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 spotlightposterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large language models (LLMs) have emerged as powerful tools but pose significant safety risks through harmful outputs and vulnerability to adversarial attacks. We propose SaP–short for Safety Polytope–a geometric approach to LLM safety, that learns and enforces multiple safety constraints directly in the model's representation space. We develop a framework that identifies safe and unsafe regions via the polytope's facets, enabling both detection and correction of unsafe outputs through geometric steering. Unlike existing approaches that modify model weights, SaP operates post-hoc in the representation space, preserving model capabilities while enforcing safety constraints. Experiments across multiple LLMs demonstrate that our method can effectively detect unethical inputs, reduce adversarial attack success rates while maintaining performance on standard tasks, thus highlighting the importance of having an explicit geometric model for safety. Analysis of the learned polytope facets reveals emergence of specialization in detecting different semantic notions of safety, providing interpretable insights into how safety is captured in LLMs' representation space.
Lay Summary: Large language models (LLMs) are powerful tools, but their propensity for generating harmful content and susceptibility to adversarial attacks raises significant safety concerns. We introduce Safety Polytope (SaP), a novel geometric framework designed to enhance LLM safety. SaP learns and enforces safety constraints directly within the model's inner workings, specifically in its representation space. This approach defines a "safe zone" using a geometric structure called a polytope. By identifying safe and unsafe regions within this space, SaP can detect and correct potentially harmful outputs. When the model is about to generate unsafe content, SaP "steers" its behavior back towards the safe zone. Experiments demonstrate that SaP effectively detects unethical inputs and reduces the success rate of adversarial attacks, all while maintaining the model's performance on standard tasks. This work highlights the benefit of employing an explicit geometric model to address safety in LLMs.
Link To Code: https://github.com/lasgroup/SafetyPolytope
Primary Area: Deep Learning->Large Language Models
Keywords: Constraint learning; LLM Safety/Alignment
Submission Number: 5114
Loading