A Game-Theoretic Analysis of Attack by Hiding Intent

Xinbo Wu; Huan Zhang; Abhishek Umrawal; Lav R. Varshney

A Game-Theoretic Analysis of Attack by Hiding Intent

Xinbo Wu, Huan Zhang, Abhishek Umrawal, Lav R. Varshney

20 Sept 2025 (modified: 19 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: game theory, safety, large language models

Abstract: As large language models (LLMs) become increasingly powerful, concerns about their safe use have also been heightened. Despite the deployment of various alignment mechanisms to prevent malicious use, these can still be circumvented through well-crafted adversarial prompts. Here, we introduce an adversarial prompting attack strategy for LLM-based systems: attack by hiding intent, a generalization of many practical attacks, where a malicious intent is hidden by composing the application of several skills. We propose a game-theoretic framework to characterize its interaction with a defense system implementing both prompt and response filtering. We further derive the equilibrium of the game and highlight structural advantages for the attacker. We theoretically design and analyze a defense mechanism specifically aimed at mitigating the proposed attack. Additionally, we empirically demonstrate the effectiveness of the proposed attack on several real-world LLMs across diverse malicious behaviors by comparison with existing adversarial prompting methods.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 23518

Loading