BAM-ICL: Causal Hijacking In-Context Learning with Budgeted Adversarial Manipulation

Rui Chu; Bingyin Zhao; Hanling Jiang; Shuchin Aeron; Yingjie Lao

BAM-ICL: Causal Hijacking In-Context Learning with Budgeted Adversarial Manipulation

Rui Chu, Bingyin Zhao, Hanling Jiang, Shuchin Aeron, Yingjie Lao

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: large language models, in-context learning, hijacking, budget profile

TL;DR: We propose BAM-ICL, a novel budgeted adversarial manipulation hijacking attack framework for in-context learning.

Abstract: Recent research shows that large language models (LLMs) are vulnerable to hijacking attacks under the scenario of in-context learning (ICL) where LLMs demonstrate impressive capabilities in performing tasks by conditioning on a sequence of in-context examples (ICEs) (i.e., prompts with task-specific input-output pairs). Adversaries can manipulate the provided ICEs to steer the model toward attacker-specified outputs, effectively ''hijacking'' the model's decision-making process. Unlike traditional adversarial attacks targeting single inputs, hijacking attacks in LLMs aim to subtly manipulate the initial few examples to influence the model's behavior across a range of subsequent inputs, which requires distributed and stealthy perturbations. However, existing approaches overlook how to effectively allocate the perturbation budget across ICEs. We argue that fixed budgets miss the potential of dynamic reallocation to improve attack success while maintaining high stealthiness and text quality. In this paper, we propose BAM-ICL, a novel **b**udgeted **a**dversarial **m**anipulation hijacking attack framework for in-context learning. We also consider a more practical yet stringent scenario where ICEs arrive sequentially and only the current ICE can be perturbed. BAM-ICL mainly consists of two stages: In the offline stage, where we assume the adversary has access to data drawn from the same distribution as the target task, we develop a global gradient-based attack to learn optimal budget allocations across ICEs. In the online stage, where ICEs arrive sequentially, perturbations are generated progressively according to the learned budget profile. We evaluate BAM-ICL on diverse LLMs and datasets. The experimental results demonstrate that it achieves superior attack success rates and stealthiness, and the adversarial ICEs are highly transferable to other models.

Supplementary Material: zip

Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)

Submission Number: 16745

Loading