Active Attacks: Red-teaming LLMs via Adaptive Environments

Taeyoung Yun; Pierre-Luc St-Charles; Jinkyoo Park; Yoshua Bengio; Minsu Kim

Active Attacks: Red-teaming LLMs via Adaptive Environments

Taeyoung Yun, Pierre-Luc St-Charles, Jinkyoo Park, Yoshua Bengio, Minsu Kim

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, AI Safety, Red-teaming, Reinforcement Learning

TL;DR: RL-based red-teaming with evolving environments by safety fine-tuning victim LLM to promote easy-to-hard exploration.

Abstract: We address the challenge of generating diverse attack prompts for large language models (LLMs) that elicit harmful behaviors (e.g., insults, sexual content) and are used for safety fine-tuning. Rather than relying on manual prompt engineering, attacker LLMs can be trained with reinforcement learning (RL) to automatically generate such prompts using only a toxicity classifier as a reward. However, capturing a wide range of harmful behaviors is a significant challenge that requires explicit diversity objectives. Existing diversity-seeking RL methods often collapse to limited modes: once high-reward prompts are found, exploration of new regions is discouraged. Inspired by the active learning paradigm that encourages adaptive exploration, we introduce \textbf{Active Attacks}, a novel RL-based red-teaming algorithm that adapts its attacks as the victim evolves. By periodically safety fine-tuning the victim LLM with collected attack prompts, rewards in exploited regions diminish, which forces the attacker to seek unexplored vulnerabilities. This process naturally induces an \emph{easy-to-hard exploration curriculum}, where the attacker progresses beyond easy modes toward increasingly difficult ones. As a result, Active Attacks uncovers a wide range of local attack modes step by step, and their combination achieves wide coverage of the multi-mode distribution. Active Attacks, a simple plug-and-play module that seamlessly integrates into existing RL objectives, unexpectedly outperformed prior RL-based methods—including GFlowNets, PPO, and REINFORCE—by improving cross-attack success rates against GFlowNets, the previous state-of-the-art, from 0.07\% to 31.28\% (a relative gain greater than 400×) with only a 6\% increase in computation.

Primary Area: reinforcement learning

Submission Number: 10617

Loading