With Great Backbones Comes Great Adversarial Transferability

ICLR 2026 Conference Submission18973 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: adversarial, attack, security, evaluation, robustness, transferability, safety
TL;DR: SSL pre-trained backbones like ResNet and ViT face overlooked adversarial risks. This study reveals minimal tuning knowledge enables near-white-box attacks, exposing vulnerabilities in model-sharing practices.
Abstract: Advancements in self-supervised learning (SSL) for machine vision have enhanced representation robustness and model performance, leading to the emergence of publicly shared pre-trained backbones, such as $\text{\emph{ResNet}}$ and $\text{\emph{ViT}}$ models tuned with SSL methods like $\text{\emph{SimCLR}}$. Due to the computational and data demands of pre-training, the utilisation of such backbones becomes a strenuous necessity. However, employing backbones may imply adhering to the existing vulnerabilities towards adversarial attacks. Prior research on adversarial robustness typically examines attacks with either full ($\text{\emph{white-box}}$) or no direct access ($\text{\emph{black-box}}$) to the target model, but the adversarial robustness of models tuned on known pre-trained backbones remains largely unexplored. Furthermore, it is unclear which tuning configuration is critical for mitigating exploitation risks. In this work, we systematically study the adversarial robustness of models that use such backbones, evaluating $20,000$ combinations of tuning configurations, including fine-tuning techniques, backbone families, datasets, and attack types. To uncover and exploit vulnerabilities, we propose using proxy models to transfer adversarial attacks, fine-tuning them with various configurations to simulate different levels of knowledge about the target. Our findings show that proxy-based attacks can outperform strong query-based $\text{\emph{black-box}}$ methods with sizable budgets approaching the effectiveness of $\text{\emph{white-box}}$ methods. Critically, we construct a naive $\text{``backbone attack"}$, leveraging only the shared backbone, and show that even it can achieve efficacy consistently surpassing $\text{\emph{black-box}}$ and and closing in towards $\text{\emph{white-box}}$ attacks, thus exposing critical risks in model-sharing practices. Finally, our ablations reveal how tuning configuration knowledge impacts attack transferability.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 18973
Loading