Abstract: Red teaming attacks are a proven method for identifying weaknesses in large language models (LLMs). With the improving generation capabilities of LLMs, researchers are successfully using them to automatically generate red teaming attacks, often achieving results by creating adversarial prompts targeting other LLMs. However, there is currently no effective strategy to choose suitable LLMs for red teaming attacks. In this work, we establish a framework to investigate the impact of various factors of LLMs on generating red teaming attacks, including model security, general capabilities, and the number of parameters. The goal of this study is to understand the mechanisms behind the effectiveness of red teaming LLMs and to provide a basis for selecting the appropriate red teaming LLM.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: Red Teaming
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 1550
Loading