RedTopic: Toward Topic-Diverse Red Teaming of Large Language Models

ICLR 2026 Conference Submission19625 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Trustworthy AI, Red Teaming, Topic-Diversity, Large Language Model, Reinforcement Fine-Tuning
Abstract: As Large Language Models (LLMs) are increasingly deployed as black-box components in real-world applications, $\textbf{red teaming}$ has become critical. It probes LLMs with curated adversarial prompts to identify vulnerabilities and guide subsequent safety alignment. Effective red teaming should be adaptive to evolving LLM capabilities and cover a broad spectrum of harmful topics. However, existing topic-based approaches rely on predefined malicious topic sets, which limit their adaptability and ability to uncover diverse adversarial prompts. In contrast, topic-free methods can automatically generate prompts with new topics, but most depend on Reinforcement Fine-Tuning (RFT) and suffer from limited coverage due to the exploration–exploitation dilemma, leading to adversarial prompts that remain topically narrow. To address these limitations, we propose $\textbf{RedTopic}$, a novel red teaming framework that generates topic-diverse adversarial prompts through a contextualized generation pipeline, an aggregate reward design, and a multi-objective RFT optimization loop. Experiments show that RedTopic produces more effective and diverse adversarial prompts than existing methods, with notable improvements in integrated evaluation metrics. We believe RedTopic represents a step toward more adaptive and topic-diverse red teaming for LLMs. $\textcolor{red}{\text{WARNING: This paper contains examples of potentially harmful text.}}$
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 19625
Loading