Working Memory Attack on LLMs

Bibek Upadhayay; Vahid Behzadan; Amin Karbasi

Working Memory Attack on LLMs

Bibek Upadhayay, Vahid Behzadan, Amin Karbasi

Published: 05 Mar 2025, Last Modified: 20 Apr 2025NFAM 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Track: long paper (up to 5 pages)

Keywords: Working Memory Attack, LLM Jailbreak, Safety Alignment, LLMs Robustness

TL;DR: Inspired by human working memory constraints, we crafted an attack vector to exploit working memory in LLMs, achieving up to 99.9% attack success rate in state-of-the-art LLMs.

Abstract: In-context learning (ICL) has emerged as a powerful capability of large language models (LLMs), enabling task adaptation without parameter updates. However, this capability also introduces potential vulnerabilities that could compromise model safety and security. Drawing inspiration from neuroscience, particularly the concept of working memory limitations, we investigate how these constraints can be exploited in LLMs through ICL. We develop a novel multi-task methodology extending the neuroscience dual-task paradigm to systematically measure the impact of working memory overload. Our experiments demonstrate that progressively increasing task-irrelevant token generation before the \emph{observation task} degrades model performance, providing a quantifiable measure of working memory load. Building on these findings, we present a new attack vector that exploits working memory overload to bypass safety mechanisms in state-of-the-art LLMs, achieving high attack success rates across multiple models. We empirically validate this threat model and show that advanced models such as GPT-4, Claude-3.5 Sonnet, Claude-3 OPUS, Llama-3-70B-Instruct, Gemini-1.0-Pro, and Gemini-1.5-Pro can be successfully jailbroken, with attack success rates of up to 99.99%. Additionally, we demonstrate the transferability of these attacks, showing that higher-capability LLMs can be used to craft working memory overload attacks targeting other models. By expanding our experiments to encompass a broader range of models and by highlighting vulnerabilities in LLMs' ICL, we aim to ensure the development of safer and more reliable AI systems. We have publicly released our jailbreak code and artifacts at this [URL](https://github.com/UNHSAILLab/working-memory-attack-on-llms).

Submission Number: 23

Loading