Gandalf the Red: Adaptive Security for LLMs

Niklas Pfister; Václav Volhejn; Manuel Knott; Santiago Arias; Julia Bazinska; Mykhailo Bichurin; Alan Y. Commike; Janet Darling; Peter Dienes; Matthew Fiedler; David Haber; Matthias Kraft; Marco Lancini; Max Mathys; Damian Pascual-Ortiz; Jakub Podolak; Adrià Romero-López; Kyriacos Shiarlis; Andreas Signer; Zsolt Terek; Athanasios Theocharis; Daniel Timbrell; Samuel Trautwein; Samuel Watts; Yun-Han Wu; Mateo Rojas-Carulla

Gandalf the Red: Adaptive Security for LLMs

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We introduce D-SEC, a novel threat model for LLM applications that accounts for adaptive attacks and the security-utility trade-off, and evaluate it using Gandalf, a large-scale crowd-sourced gamified red-teaming platform.

Abstract: Current evaluations of defenses against prompt attacks in large language model (LLM) applications often overlook two critical factors: the dynamic nature of adversarial behavior and the usability penalties imposed on legitimate users by restrictive defenses. We propose D-SEC (Dynamic Security Utility Threat Model), which explicitly separates attackers from legitimate users, models multi-step interactions, and expresses the security-utility in an optimizable form. We further address the shortcomings in existing evaluations by introducing Gandalf, a crowd-sourced, gamified red-teaming platform designed to generate realistic, adaptive attack. Using Gandalf, we collect and release a dataset of 279k prompt attacks. Complemented by benign user data, our analysis reveals the interplay between security and utility, showing that defenses integrated in the LLM (e.g., system prompts) can degrade usability even without blocking requests. We demonstrate that restricted application domains, defense-in-depth, and adaptive defenses are effective strategies for building secure and useful LLM applications.

Lay Summary: A major security flaw in today’s AI systems is their inability to tell the difference between data and instructions. This opens the door for attackers to take control of systems in unexpected ways. One of the biggest open challenges is figuring out how to evaluate an AI system’s security, and how that security affects legitimate users. After all, a system that blocks everything might be secure, but it’s not useful. We bring two core contributions. First, D-SEC, a new evaluation framework that balances how well a system defends against attackers with how it affects normal users. It also accounts for the fact that attackers adapt and learn over time. Second, we introduce Gandalf, a crowdsourced game where players try to trick AI systems. This lets us study creative, real-world attacks, and test defenses using D-SEC. We found that many application defenses unintentionally hurt the user experience. Based on our research, we identify three strategies for building secure and usable AI: limit what the system can do, layer defenses, and use adaptive protections. To support the community, we’ve released a dataset of nearly 300,000 human-generated attacks to drive future work on secure, reliable AI.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Link To Code: https://github.com/lakeraai/dsec-gandalf

Primary Area: Social Aspects->Security

Keywords: Large language models, Security, Safety, Threat model, Crowd-sourcing

Flagged For Ethics Review: true

Submission Number: 7492

Loading