Fast Proxies for LLM Robustness Evaluation

Published: 05 Mar 2025, Last Modified: 07 Apr 2025BuildingTrustEveryoneRevisionsBibTeXCC BY 4.0
Track: Tiny Paper Track (between 2 and 4 pages)
Keywords: LLM Robustness, Red-Teaming
TL;DR: We show that fast proxy attacks can be leveraged to predict a model's robustness against real-world attacks at three orders of magnitude lower cost.
Abstract: Evaluating the robustness of LLMs to adversarial attacks is crucial for safe deployment, yet current red-teaming methods are often prohibitively expensive. We compare the ability of fast proxy metrics to predict the real-world robustness of an LLM against a simulated attacker ensemble. This allows us to estimate a model's robustness to computationally expensive attacks without requiring runs of the attacks themselves. Specifically, we consider gradient-descent-based embedding-space attacks, prefilling attacks, and direct attacks. Even though direct attacks in particular do not achieve high ASR, we find that they and embedding-space attacks can predict attack success rates well, achieving $r_p=0.86$ (linear) and $r_s=0.97$ (Spearman rank) correlations with the full attack ensemble while reducing computational cost by three orders of magnitude.
Submission Number: 83
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview