RobustKV: Defending Large Language Models against Jailbreak Attacks via KV Eviction

Published: 22 Jan 2025, Last Modified: 18 Feb 2025ICLR 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Jailbreak Attack, Large Language Model, KV cache optimization
TL;DR: We have designed a special KV cache compression policy that can help LLMs defend against Jailbreak Attacks.
Abstract: Jailbreak attacks circumvent LLMs' built-in safeguards by concealing harmful queries within adversarial prompts. While most existing defenses attempt to mitigate the effects of adversarial prompts, they often prove inadequate as adversarial prompts can take arbitrary, adaptive forms. This paper introduces RobustKV, a novel jailbreak defense that takes a fundamentally different approach by selectively removing critical tokens of harmful queries from key-value (KV) caches. Intuitively, for an adversarial prompt to be effective, its tokens must achieve sufficient `importance' (measured by attention scores), which consequently lowers the importance of tokens in the concealed harmful query. Therefore, by carefully evicting the KVs of low-ranked tokens, RobustKV minimizes the harmful query's presence in the KV cache, thus preventing the LLM from generating informative responses. Extensive evaluation using benchmark datasets and models demonstrates that RobustKV effectively counters state-of-the-art jailbreak attacks while maintaining the LLM's performance on benign queries. Notably, RobustKV creates an interesting effectiveness-evasiveness dilemma for the adversary, leading to its robustness against adaptive attacks.{(Warning: This paper contains potentially harmful content generated by LLMs.)}
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 11079
Loading