On Large Language Models for Effective Red Teaming

ACL ARR 2026 January Submission8554 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: red teaming, Large Language Models
Abstract: Red teaming attacks are a well-established approach for identifying weaknesses in large language models (LLMs). As the generative capabilities of LLMs continue to improve, researchers have increasingly leveraged them to automatically generate red teaming attacks, often by crafting adversarial prompts that target other LLMs. Despite this progress, there is currently no effective strategy for selecting suitable LLMs to serve as red teaming agents. In this work, we propose a systematic framework to investigate how various factors of LLMs influence their effectiveness in generating red teaming attacks, including model security, general capability, and parameter scale. The goal of this study is to understand the mechanisms behind the effectiveness of red teaming LLMs and to provide principled guidance for selecting appropriate red teaming LLM.
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: red teaming,adversarial attacks
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 8554
Loading