Keywords: LLMs Safety, Benchmark, Agent
Abstract: As large language models (LLMs) become increasingly integrated into real-world applications, their vulnerability to prompt-based attacks has emerged as a critical safety concern. While prior research has uncovered various threats, including jailbreaks, prompt injections, and attacks on external sources or agentic systems, most evaluations are limited in scope, assessing attacks in isolation or at a small scale. This paper poses a fundamental question: \textit{Are frontier LLMs truly robust against the full spectrum of prompt attacks when evaluated systematically and at scale?}
To explore this, we propose \textbf{Agentic Prompt Attack (\PA)}, a novel three-agent framework that automates and unifies the reproduction of prior prompt attack studies. \PA consists of (i) a Paper Agent that extracts attack specifications from research papers, (ii) a Repo Agent that retrieves implementation details from GitHub repositories, and (iii) a Code Agent that iteratively operationalizes the attack, regardless of complexity, into executable prompts targeting LLMs. The agents collaborate to resolve ambiguities and reduce contextual noise throughout the process.
Using \textbf{\PA}, we analyzed over \textbf{104} prompt attack papers to build a large-scale, standardized attack library. This enables systematic stress-testing of frontier LLMs, revealing that even the most recent frontier models remain vulnerable to a wide range of known threats, highlighting persistent gaps in current safety alignment.
Our work introduces a new paradigm for evaluating LLM safety at scale, offering both a comprehensive benchmark for researchers and actionable guidance for developing more robust foundation models.
**WARNING: This paper contains examples of potentially harmful content.**
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 9071
Loading