Diverse and Efficient Red-Teaming for LLM Agents via Distilled Structured Reasoning

Published: 27 Oct 2025, Last Modified: 27 Oct 2025NeurIPS Lock-LLM Workshop 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM Agents, Red-teaming, Reasoning, Safety Evaluation
TL;DR: We present a black-box red-teaming framework that generates diverse seed tests and iteratively crafts adversarial attacks using a compact red-teamer trained via distilled structured reasoning.
Abstract: The ability of LLM agents to plan and invoke tools exposes them to new safety risks, making a comprehensive red-teaming system crucial for discovering vulnerabilities and ensuring their safe deployment. We present a generic red-teaming framework for arbitrary black-box LLM agents. We employ a dynamic two-step process that starts with an agent definition and generates diverse seed test cases that cover diverse risk outcomes, tool-use trajectories, and risk sources. Then, it iteratively constructs and refines model-based adversarial attacks based on the execution trajectories of former attempts. To optimize the red-teaming cost, we present a model distillation approach that leverages structured forms of a teacher model’s reasoning to train smaller models that are equally effective. Across diverse evaluation agent settings, our seed test case generation approach yields 2– 2.5x boost to the coverage of risk outcomes and tool-calling trajectories. Our distilled 8B red-teamer model improves attack success rate by 75%, achieving a comparable rate to the 671B Deepseek-R1 model. Our analyses confirm the effectiveness of the iterative framework, structured reasoning, and the generalization of our red-teamer models.
Submission Number: 65
Loading