Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use
Keywords: SLMs, Agentic Models, Safety, Alignment
TL;DR: We align agentic language models for safe multi-step tool use by explicitly training them to plan, check, and decide when to act or refuse using trajectory-level preference learning.
Abstract: Agentic language models operate in a distinct safety regime from chat models: they plan, call tools, and execute long-horizon actions where a single error (e.g., file access or credential entry) can cause irreversible harm. Alignment methods optimized for static generation fail in this setting due to sequential decision-making, adversarial tool feedback, and overconfident intermediate reasoning. We introduce MOSAIC, a post-training framework that aligns agents for safe multi-step tool use by making safety decisions explicit and learnable. MOSAIC organizes inference as a plan–check–act/refuse loop with explicit safety reasoning and refusal as first-class actions. Training uses preference-based reinforcement learning over pairwise trajectory comparisons, avoiding trajectory-level labels while capturing safety distinctions missed by scalar rewards. MOSAIC reduces harmful behavior by up to 50\%, increases harmful-task refusal by over 20\% under injection, cuts privacy leakage, and preserves or improves benign performance, demonstrating robust generalization across models, domains, and agentic settings.
Submission Number: 80
Loading