RL-Guard: Rescuing LLM Agents from Pitfalls

Yijun Yang; Lichao Wang; Chi Harold Liu; Qiang Xu; Xiao Yang

RL-Guard: Rescuing LLM Agents from Pitfalls

Yijun Yang, Lichao Wang, Chi Harold Liu, Qiang Xu, Xiao Yang

08 Sept 2025 (modified: 11 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Agent Safety, Reinforcement Learning, Guardrails, RLHF

TL;DR: Reinforcement Learning Guard transforms safety from a reactive stopgap into a proactive co-pilot, preventing cascading failures while keeping LLM agents both capable and trustworthy.

Abstract: Large language model (LLM) agents augmented with external tools are rapidly becoming integral to both everyday assistance and high-stakes decision-making. Yet recent studies reveal a critical vulnerability: \textit{cascading failures} in multi-step tasks. A single minor error—such as misinterpreting an ambiguous name—can propagate, amplify, and ultimately derail the entire workflow. Existing safeguards act as emergency brakes: they can stop agents from catastrophic mistakes, but only by halting progress entirely, leaving users stranded. In this paper, we introduce \textbf{Reinforcement Learning Guard (RL-GUARD)}, a proactive safeguard framework that functions as a co-pilot rather than a stop button. RL-GUARD combines: (i) a \textbf{critic} that monitors trajectories and adaptively enables safety reflection, (ii) an \textbf{actor} that selects safe corrective actions from reflection-triggered candidates, and (iii) a \textbf{risk-conditioned safety reward model} that delivers precise step-level feedback during RL training. To enable robust learning, we release the first large-scale dataset for safe agent training, featuring step-level human annotations and realistic evaluation simulators. Experiments demonstrate that RL-GUARD consistently outperforms state-of-the-art (SOTA) baselines, reducing risk to \textbf{6\% on ToolEmu} and \textbf{14\% on AgentHarm}—while preserving task effectiveness. Moreover, RL-GUARD incurs only moderate overhead (29\% on GPT-4o for ToolEmu), 52\% lower than the SOTA baseline. Our results highlight RL-GUARD as a paradigm shift: from reactive stopgaps to proactive, safety-aware copilots for LLM agents.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 3108

Loading