CRDA: Content Risk Drift Assessment of Large Language Models through Adversarial Multi-Agent Interaction

Published: 01 Jan 2024, Last Modified: 24 May 2025IJCNN 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: As Large Language Models (LLMs) continue to enhance their capabilities in multi-agent collaborative applications, the unpredictability of the generative content risks has intensified. Particularly in ongoing interaction scenarios with users, it remains unclear whether there is generative content risk drift over time. In this context, "drift risk" refers to the trend of progressively intensified content risk that emerges during sustained adversarial interactions among LLM agents. Additionally, the high cost associated with constructing complex adversarial environments for agents impedes the transferability of current assessment methods for LLMs to multi-agent adversarial scenarios. In this paper, we introduce a low-cost and lightweight framework for assessing content risk drift of LLMs, named CRDA. This framework, bypassing the need for constructing complex adversarial environments, offers a method that integrates roles and responses memory to guide automatically multi-round adversarial interactions among LLM agents, that is, multiple agents as avatars of a single LLM. In this approach, LLM agents enable the analysis of content risk drift of this LLM. Moreover, we explore the impact of restricted roles and the unsafe content with negative viewpoints in responses memory on the content risk drift of LLMs. Considering the rapid advancement of Chinese LLM capabilities, this study selects real adversarial topics in Chinese and assesses content risk drift of five representative Chinese LLMs. The research finds that these LLMs exhibit significant content risk drift even after a certain safety alignment, showing an initial increase followed by a gradual decrease. As the adversarial process progresses, under restricted roles, agents more effectively breach the model's safety alignment, leading to content risk drift of the LLM. The content drift risk assessment can be quantified specifically by measuring the deterioration rate at which LLM agents deteriorate from positive to negative and analyzing the underlying trends during the automatically multi-round adversarial interactions. In restricted and general roles adversarial interactions, all agents of five Chinese LLMs exhibit an overall average increase of 31.5% and 16.38% in the cumulative deterioration rate respectively by the 10th round, compared to the baseline no-roles adversarial interactions. Finally, we hope that the framework and findings presented in this paper will offer valuable insights for research on safety alignment in LLM agents during adversarial processes.
Loading