Keywords: game theory, safety, large language models
Abstract: As large language models (LLMs) become increasingly powerful, concerns about their safe use have also been heightened. Despite the deployment of various alignment mechanisms to prevent malicious use, these can still be circumvented through well-crafted adversarial prompts. Here, we introduce an adversarial prompting attack strategy for LLM-based systems: attack by hiding intent, a generalization of many practical attacks, where a malicious intent is hidden by composing the application of several skills. We propose a game-theoretic framework to characterize its interaction with a defense system implementing both prompt and response filtering. We further derive the equilibrium of the game and highlight structural advantages for the attacker. We theoretically design and analyze a defense mechanism specifically aimed at mitigating the proposed attack. Additionally, we empirically demonstrate the effectiveness of the proposed attack on several real-world LLMs across diverse malicious behaviors by comparison with existing adversarial prompting methods.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 23518
Loading