TransferBench: Benchmarking Ensemble-based Black-box Transfer Attacks

Published: 18 Sept 2025, Last Modified: 30 Oct 2025NeurIPS 2025 Datasets and Benchmarks Track posterEveryoneRevisionsBibTeXCC BY-NC-ND 4.0
Keywords: Black-box Attacks, Ensemble Transfer Attacks, Surrogate Models, Transferability Evaluation, Query Efficiency, Adversarial Benchmark
TL;DR: We introduce TransferBench, a comprehensive benchmark for evaluating ensemble-based black-box adversarial attacks under realistic scenarios, revealing limitations in surrogate model choices, robustness generalization, and query efficiency.
Abstract: Ensemble-based black-box transfer attacks optimize adversarial examples on a set of surrogate models, claiming to reach high success rates by querying the (unknown) target model only a few times. In this work, we show that prior evaluations are systematically biased, as such methods are tested only under overly optimistic scenarios, without considering (i) how the choice of surrogate models influences transferability, (ii) how they perform against robust target models, and (iii) whether querying the target to refine the attack is really required. To address these gaps, we introduce TransferBench, a framework for evaluating ensemble-based black-box transfer attacks under more realistic and challenging scenarios than prior work. Our framework considers 17 distinct settings on CIFAR-10 and ImageNet, including diverse surrogate-target combinations, robust targets, and comparisons to baseline methods that do not use any query-based refinement mechanism. Our findings reveal that existing methods fail to generalize to more challenging scenarios, and that query-based refinement offers little to no benefit, contradicting prior claims. These results highlight that building reliable and query-efficient black-box transfer attacks remains an open challenge. We release our benchmark and evaluation code at: https://github.com/pralab/transfer-bench.
Code URL: https://github.com/pralab/transfer-bench
Supplementary Material: zip
Primary Area: Dataset and Benchmark for Optimization (e.g., convex and non-convex, stochastic, robust, metrics for optimization, scaling of datasets, benchmarks)
Flagged For Ethics Review: true
Submission Number: 1014
Loading