Keywords: Agent Safety, Reinforcement Learning, Guardrails, RLHF
TL;DR: Reinforcement Learning Guard transforms safety from a reactive stopgap into a proactive co-pilot, preventing cascading failures while keeping LLM agents both capable and trustworthy.
Abstract: Large language model (LLM) agents augmented with external tools are rapidly becoming integral to both everyday assistance and high-stakes decision-making. Yet recent studies reveal a critical vulnerability: \textit{cascading failures} in multi-step tasks. A single minor error—such as misinterpreting an ambiguous name—can propagate, amplify, and ultimately derail the entire workflow. Existing safeguards act as emergency brakes: they can stop agents from catastrophic mistakes, but only by halting progress entirely, leaving users stranded.
In this paper, we introduce \textbf{Reinforcement Learning Guard (RL-GUARD)}, a proactive safeguard framework that functions as a co-pilot rather than a stop button. RL-GUARD combines: (i) a \textbf{critic} that monitors trajectories and adaptively enables safety reflection, (ii) an \textbf{actor} that selects safe corrective actions from reflection-triggered candidates, and (iii) a \textbf{risk-conditioned safety reward model} that delivers precise step-level feedback during RL training. To enable robust learning, we release the first large-scale dataset for safe agent training, featuring step-level human annotations and realistic evaluation simulators.
Experiments demonstrate that RL-GUARD consistently outperforms state-of-the-art (SOTA) baselines, reducing risk to \textbf{6\% on ToolEmu} and \textbf{14\% on AgentHarm}—while preserving task effectiveness. Moreover, RL-GUARD incurs only moderate overhead (29\% on GPT-4o for ToolEmu), 52\% lower than the SOTA baseline. Our results highlight RL-GUARD as a paradigm shift: from reactive stopgaps to proactive, safety-aware copilots for LLM agents.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 3108
Loading