On Large Language Models for Effective Red Teaming

On Large Language Models for Effective Red Teaming

ACL ARR 2026 January Submission8554 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: red teaming, Large Language Models

Abstract: Red teaming attacks are a well-established approach for identifying weaknesses in large language models (LLMs). As the generative capabilities of LLMs continue to improve, researchers have increasingly leveraged them to automatically generate red teaming attacks, often by crafting adversarial prompts that target other LLMs. Despite this progress, there is currently no effective strategy for selecting suitable LLMs to serve as red teaming agents. In this work, we propose a systematic framework to investigate how various factors of LLMs influence their effectiveness in generating red teaming attacks, including model security, general capability, and parameter scale. The goal of this study is to understand the mechanisms behind the effectiveness of red teaming LLMs and to provide principled guidance for selecting appropriate red teaming LLM.

Paper Type: Long

Research Area: Ethics, Bias, and Fairness

Research Area Keywords: red teaming,adversarial attacks

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 8554

Loading