Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use

Aradhye Agarwal; Gurdit Siyan; Yash Pandya; Joykirat Singh; Akshay Nambi

Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use

Aradhye Agarwal, Gurdit Siyan, Yash Pandya, Joykirat Singh, Akshay Nambi

Published: 02 Mar 2026, Last Modified: 05 Mar 2026Agentic AI in the Wild: From Hallucinations to Reliable Autonomy PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: SLMs, Agentic Models, Safety, Alignment

TL;DR: We align agentic language models for safe multi-step tool use by explicitly training them to plan, check, and decide when to act or refuse using trajectory-level preference learning.

Abstract: Agentic language models operate in a distinct safety regime from chat models: they plan, call tools, and execute long-horizon actions where a single error (e.g., file access or credential entry) can cause irreversible harm. Alignment methods optimized for static generation fail in this setting due to sequential decision-making, adversarial tool feedback, and overconfident intermediate reasoning. We introduce MOSAIC, a post-training framework that aligns agents for safe multi-step tool use by making safety decisions explicit and learnable. MOSAIC organizes inference as a plan–check–act/refuse loop with explicit safety reasoning and refusal as first-class actions. Training uses preference-based reinforcement learning over pairwise trajectory comparisons, avoiding trajectory-level labels while capturing safety distinctions missed by scalar rewards. MOSAIC reduces harmful behavior by up to 50\%, increases harmful-task refusal by over 20\% under injection, cuts privacy leakage, and preserves or improves benign performance, demonstrating robust generalization across models, domains, and agentic settings.

Submission Number: 80

Loading