Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation

Ye Zhang, Jun Zhuang, Gaby G. Dagher, Haibo Jin, Haohan Wang, Zhengjian Kang, Wenbin Zhang

Published: 01 Nov 2025, Last Modified: 09 Nov 2025OpenReview Archive Direct UploadEveryoneCC BY-ND 4.0

Abstract: Intent detection, a core component of natural language understanding, has considerably evolved as a crucial mechanism in safeguarding large language models (llms). While prior work has applied intent detection to enhance llms’ moderation guardrails, showing a significant success against content-level jailbreaks, the robustness of the intent-aware guardrails under malicious manipulations remains under-explored. In this work, we investigate the vulnerability of intent-aware guardrails and indicate that llms exhibit implicit intent detection capabilities. We propose a two-stage intent-based prompt-refinement framework, IntentPrompt, that first transforms harmful inquiries into structured outlines and further reframes them into declarative-style narratives by iteratively optimizing prompts via feedback loops to enhance jailbreak success for red-teaming purposes. Extensive experiments across four public benchmarks and various black-box llms indicate that our framework consistently outperforms several cutting-edge jailbreak methods and evades even advanced Intent Analysis (IA) and Chain-of-Thought (CoT)-based defenses. Specifically, our “FSTR+SPIN” variant achieves attack success rates ranging from 88.25% to 96.54% against CoT-based defenses on the o1 model, and from 86.75% to 97.12% on the GPT-4o model under IA-based defenses. These findings highlight a critical weakness in llms’ safety mechanisms and suggest that intent manipulation poses a growing challenge to content moderation guardrails.