Comparison requires valid measurement: Rethinking attack success rate comparisons in AI red teaming

Published: 26 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 Position Paper TrackEveryoneRevisionsBibTeXCC BY 4.0
Keywords: red teaming, measurement, evaluation, validity, attack success rates, jailbreaking, threat model, measurement theory, conceptualization, LLM-as-judge
TL;DR: We argue that conclusions drawn about relative system safety or attack method efficacy via AI red teaming are often not supported by evidence provided by attack success rate (ASR) comparisons.
Abstract: In this position paper we argue that conclusions drawn about relative system safety or attack method efficacy via AI red teaming are often not supported by evidence provided by attack success rate (ASR) comparisons. We show, through conceptual, theoretical, and empirical contributions, that many conclusions are founded on apples-to-oranges comparisons or low-validity measurements. Our arguments are grounded in asking a simple question: When can attack success rates be meaningfully compared? To answer this question, we draw on ideas from social science measurement theory and inferential statistics, which, taken together, provide a conceptual grounding for understanding when numerical values obtained through the quantification of system attributes can be meaningfully compared. Through this lens, we articulate conditions under which ASRs can and cannot be meaningfully compared. Using jailbreaking as a running example, we provide examples and extensive discussion of apples-to-oranges ASR comparisons and measurement validity challenges.
Lay Summary: AI red teaming is a popular way to test whether generative AI systems behave safely. The aim of red teaming is to get systems to produce harmful or policy-violating responses. In more standardized or automated forms of red teaming, it is common to report the attack success rate (ASR) of red teaming activities: the percentage of attacks that succeeded in eliciting such responses from a given system. These ASRs are then compared across systems or red teaming methods to claim that a system with a lower ASR is “safer,” or that a method with a higher ASR is “more effective.” Our paper shows that drawing such conclusions from ASR comparisons can be misleading. Drawing on ideas from measurement theory and statistics, we explain why these comparisons are at times “apples to oranges.” Studies sometimes count success differently for one red teaming method than another, and the tools used to judge whether an attack succeeded do not always perform equally well across systems or methods—leading to differences in ASRs that that do not reflect true differences in safety or attack efficacy. We offer a framework for determining when ASRs can be meaningfully compared and how to design red teaming activities that support valid comparison. By identifying pitfalls in current practice and providing concrete recommendations, our goal is to make AI safety evaluations more rigorous, interpretable, and trustworthy.
Submission Number: 582
Loading