Strengthening Robustness to Adversarial Prompts: The Role of Multi-Agent Conversations in Large Language Models

ICLR 2025 Workshop BuildingTrust Submission140 Authors

11 Feb 2025 (modified: 06 Mar 2025)Submitted to BuildingTrustEveryoneRevisionsBibTeXCC BY 4.0
Track: Long Paper Track (up to 9 pages)
Keywords: large language models, adversarial prompts, extended debates, multi-agent framework, toxicity reduction, cross-provider, defense mechanisms, model resilience
TL;DR: Extended debates among diverse, cross-provider agents significantly improve large language models’ resilience against adversarial prompts by reducing toxicity and attack success compared to conventional defenses.
Abstract: While Large language models have shown impressive capabilities in problem-solving, understanding, and reasoning \citep{Touvron2023, Du2023}, yet remain susceptible to sophisticated adversarial prompts that can manipulate models to generate harmful outputs \citep{Zou2023, Wei2023}. Current defense mechanisms, such as self-refinement and safety guardrails \citep{Korbak2023, Robey2023}, have shown limited effectiveness against these attacks. Building upon the multi-agent debate framework \citep{Chern2024}, our research demonstrates how extended debates among diverse debaters enhance model resilience \citep{Chan2023}. Using multiple attack techniques, we assess toxicity and attack success across varying debaters and debate lengths \citep{Ganguli2022, Perez2022}. Our results demonstrate that cross-provider debates with extended interaction periods achieve significantly lower toxicity scores than single-provider systems. These findings advance our understanding of collaborative defense mechanisms in language models \citep{Cohen2023}.
Submission Number: 140
Loading