Abstract: Contextual priming, where earlier stimuli covertly bias later judgments, offers an unexplored attack surface for large language models (LLMs). We uncover a contextual priming vulnerability in which the previous response in the dialogue can steer its subsequent behavior toward policy-violating content. Building on this insight, we propose \textbf{Response Attack}, which uses an auxiliary LLM to generate an intermediate harmful response to a paraphrased version of the original malicious query. They are then formatted into the dialogue and followed by a succinct follow-up prompt, thereby priming the target model to generate harmful content. Extensive experiments on both proprietary and open-source LLMs show that Response Attack achieves higher attack success rates and efficiency than state-of-the-art baselines. To mitigate this threat, we construct and release a context-aware safety fine-tuning dataset, which significantly reduces attack success while preserving model capabilities.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: safety and alignment,red teaming
Contribution Types: NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 5845
Loading