On Large Language Models for Effective Red Teaming

On Large Language Models for Effective Red Teaming

ACL ARR 2024 December Submission1550 Authors

16 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Red teaming attacks are a proven method for identifying weaknesses in large language models (LLMs). With the improving generation capabilities of LLMs, researchers are successfully using them to automatically generate red teaming attacks, often achieving results by creating adversarial prompts targeting other LLMs. However, there is currently no effective strategy to choose suitable LLMs for red teaming attacks. In this work, we establish a framework to investigate the impact of various factors of LLMs on generating red teaming attacks, including model security, general capabilities, and the number of parameters. The goal of this study is to understand the mechanisms behind the effectiveness of red teaming LLMs and to provide a basis for selecting the appropriate red teaming LLM.

Paper Type: Long

Research Area: Language Modeling

Research Area Keywords: Red Teaming

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 1550

Loading