There's Levels to It: Red Teaming Dialogue With Hierarchical Reinforcement Learning

ACL ARR 2026 January Submission9757 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement Learning, LLM Red Teaming
Abstract: Red teaming is essential for securing Large Language Models (LLMs), yet current automated methods remain limited by brittle templates and single-turn attacks. These approaches fail to simulate the complex, interactive nature of real-world adversarial attacks. We introduce a novel red teaming paradigm designed to maximize expected cumulative harm through strategic interaction. By formalizing red teaming as a Markov Decision Process (MDP) within a hierarchical Reinforcement Learning (RL) framework, we effectively navigate the challenges of sparse rewards and long-horizon planning. Our generative agent learns diverse, multi-turn attacks using a token-level harm reward, consistently uncovering vulnerabilities that bypass existing baselines. This approach establishes a new state-of-the-art, re-framing LLM red teaming as a principled, trajectory-based optimization process.
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: Autonomous agents, bias/toxicity, dialogue, automatic evaluation, red teaming, reinforcement learning
Contribution Types: Publicly available software and/or pre-trained models, Theory
Languages Studied: English
Submission Number: 9757
Loading