Understanding Model Ensemble in Transferable Adversarial Attack

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We establish a theoretical framework for transferable model ensemble adversarial attacks through statistical learning theory.
Abstract: Model ensemble adversarial attack has become a powerful method for generating transferable adversarial examples that can target even unknown models, but its theoretical foundation remains underexplored. To address this gap, we provide early theoretical insights that serve as a roadmap for advancing model ensemble adversarial attack. We first define transferability error to measure the error in adversarial transferability, alongside concepts of diversity and empirical model ensemble Rademacher complexity. We then decompose the transferability error into vulnerability, diversity, and a constant, which rigidly explains the origin of transferability error in model ensemble attack: the vulnerability of an adversarial example to ensemble components, and the diversity of ensemble components. Furthermore, we apply the latest mathematical tools in information theory to bound the transferability error using complexity and generalization terms, validating three practical guidelines for reducing transferability error: (1) incorporating more surrogate models, (2) increasing their diversity, and (3) reducing their complexity in cases of overfitting. Finally, extensive experiments with 54 models validate our theoretical framework, representing a significant step forward in understanding transferable model ensemble adversarial attacks.
Lay Summary: In a transferable adversarial attack, model ensemble is like a team of students, each learning their own unique ways to cheat on different exams. If this team can collectively discover a general cheating strategy, that method can successfully fool new teachers they've never encountered before (i.e., unknown AI models). Our theoretical research reveals that the success of such attacks primarily stems from two main aspects: first, the **vulnerability** of the AI models to the attack itself – the attack needs to be effective enough. Second, the **diversity** among the AI models used to launch the attack is crucial – they shouldn't be too similar, just as a cheating team benefits from students with varied trickery. Theoretically, an attack becomes stronger if we: 1) incorporate more students (i.e., more AI models); 2) increase the differences in their cheating methods (i.e., enhance model diversity); and 3) reduce how much each student memorizes for cheating (i.e., lower model complexity in cases of overfitting). This provides theoretical support for a field that has largely relied on experimental observations.
Primary Area: Social Aspects->Robustness
Keywords: adversarial transferability, model ensemble attack
Submission Number: 4274
Loading