AgentAlign: Navigating Safety Alignment in the Shift from Informative to Agentic Large Language Models

ICLR 2026 Conference Submission5011 Authors

14 Sept 2025 (modified: 28 Nov 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: language agents, safety alignment
Abstract: The emergence of agentic capabilities in large language models fundamentally transforms their risk profile from passive information providers to autonomous action executors, introducing unprecedented safety challenges that existing alignment methods fail to address. Current approaches lack systematic frameworks for understanding and modeling the behavioral patterns underlying malicious agentic activities, leading to brittle safety measures that collapse when confronted with multi-step harmful requests. We introduce AgentAlign, a novel behavioral modeling framework for agentic alignment that systematically captures malicious activity patterns through abstract behavior chains – structured representations of action sequences that characterize how harmful objectives are pursued across diverse tool-use scenarios. By instantiating these behavioral abstractions within comprehensive simulated environments, our framework enables scalable generation of authentic, executable training scenarios that preserve complex multi-step dynamics while avoiding real-world risks. Extensive evaluation across three model families demonstrates substantial safety improvements (35.8% to 79.5% enhancement in refusal rates) while maintaining or improving utility on benign tasks, significantly outperforming existing prompting-based defenses.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 5011
Loading