Keywords: certified robustness; knapsack problems; large language models.
TL;DR: Upper bound the worst-case robustness by strong adaptive attacks and lower bound the worst-case robustness by fractional/0-1 knapsack solver
Abstract: Recent studies have revealed the vulnerability of large language models to adversarial attacks, where adversaries craft specific inputs to induce wrong or even harmful outputs. Although various empirical defenses have been proposed, their worst-case robustness remains unexplored, raising concerns about the vulnerability to future stronger adversaries. In this paper, we systematically study the worst-case robustness of LLMs from both empirical and theoretical perspectives. First, we upper bound the worst-case robustness of deterministic defenses using enhanced white-box attacks, showing that most of them achieve nearly 0\% robustness against white-box adversaries. Then, we derive a general tight lower bound for randomized smoothing using fractional or 0-1 knapsack solvers, and apply them to derive theoretical lower bounds of the worst-case robustness for previous stochastic defenses. For example, we certify the robustness of GPT-4o with uniform kernel smoothing against \textit{any possible attack}, with an average \(\ell_0\) perturbation of 2.02 or an average suffix length of 6.41 on the AdvBench dataset.
Primary Area: learning theory
Submission Number: 5044
Loading