ChainAttack: Black-Box Adversarial Attacks on Generative AI Services via Chain-of-Thought

Published: 2025, Last Modified: 12 Nov 2025ICWS 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The advancement of large language models (LLMs) has made generative AI services an indispensable component of modern web applications, offering powerful capabilities through API-based interactions. However, the black-box nature of these services, where users lack access to internal mechanisms such as training or fine-tuning processes, poses significant challenges for adversarial attack research. Traditional attack methods typically require access to model parameters or gradients, which is impractical in real-world scenarios. In this paper, we introduce ChainAttack, a black-box adversarial method that exploits LLMs' incontext learning by crafting Chain-of-Thought (CoT) adversarial prompts. This method manipulates intermediate reasoning steps to elicit harmful outputs without requiring access to model internals. Our evaluation across six benchmarks demonstrates that ChainAttack achieves a 34.4% success rate, particularly excelling in logic-intensive tasks such as mathematical problem solving. This work exposes critical vulnerabilities in CoT prompting and underscores the need for future research on building more robust and trustworthy generative AI systems.
Loading