Keywords: Agentic reinforcement learning, LLM jailbreaks, Safety evaluation
TL;DR: A single search token is enough to jailbreak agentic RL–trained search models.
Abstract: Agentic reinforcement learning (RL) trains large language models to autonomously call external tools during reasoning, with search as the most common application. These models perform well on multi-step reasoning tasks, but their safety properties are not well understood. In this study, we show that RL-trained search models inherit refusal behaviours from instruction tuning, often blocking harmful prompts by turning them into safe queries. However, this inherited safety is fragile. Two simple attacks, one that forces the model to begin its response with search (search attack), and another that encouraging models to repeatedly search (multi-search attack), causes cascades of harmful searches and answers. Compared to base search models, these attacks lower refusal rates by up to 59.5%, safety of final answers by up to 82.3% and safety of search queries by up to 81.6%. Our results hold across two model families, both with access to local databases and web search. The attacks succeed by triggering models to generate search queries before they get a chance to generate their inherited refusal tokens. This exposes a key weakness of current RL training: it rewards effective search queries without considering their harmfulness. As a result, RL search models have vulnerabilities that users can easily exploit, making it urgent to develop safety-aware agentic RL pipelines for tool use.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 21741
Loading