Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models

Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models

ACL ARR 2025 May Submission5845 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Contextual priming, where earlier stimuli covertly bias later judgments, offers an unexplored attack surface for large language models (LLMs). We uncover a contextual priming vulnerability in which the previous response in the dialogue can steer its subsequent behavior toward policy-violating content. Building on this insight, we propose \textbf{Response Attack}, which uses an auxiliary LLM to generate an intermediate harmful response to a paraphrased version of the original malicious query. They are then formatted into the dialogue and followed by a succinct follow-up prompt, thereby priming the target model to generate harmful content. Extensive experiments on both proprietary and open-source LLMs show that Response Attack achieves higher attack success rates and efficiency than state-of-the-art baselines. To mitigate this threat, we construct and release a context-aware safety fine-tuning dataset, which significantly reduces attack success while preserving model capabilities.

Paper Type: Long

Research Area: Language Modeling

Research Area Keywords: safety and alignment,red teaming

Contribution Types: NLP engineering experiment, Data resources

Languages Studied: English

Submission Number: 5845

Loading