Second-order Jailbreaks: Generative Agents Successfully Manipulate Through an Intermediary

Published: 31 Oct 2023, Last Modified: 29 Nov 2023MASEC@NeurIPS'23 PosterEveryoneRevisionsBibTeX
Keywords: large language models, multi-agent, negotiations, security
Abstract: As the capabilities of Large Language Models (LLMs) continue to expand, their application in communication tasks is becoming increasingly prevalent. However, this widespread use brings with it novel risks, including the susceptibility of LLMs to "jailbreaking" techniques. In this paper, we explore the potential for such risks in two- and three-agent communication networks, where one agent is tasked with protecting a password while another attempts to uncover it. Our findings reveal that an attacker, powered by advanced LLMs, can extract the password even through an intermediary that is instructed to prevent this. Our contributions include an experimental setup for evaluating the persuasiveness of LLMs, a demonstration of LLMs' ability to manipulate each other into revealing protected information, and a comprehensive analysis of this manipulative behavior. Our results underscore the need for further investigation into the safety and security of LLMs in communication networks.
Submission Number: 21
Loading