Towards Reliable Transferability of Targeted Adversarial Attacks against Model Discrepancy

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Targeted Adversarial Attacks, Transferability
Abstract: Adversarial attacks pose a serious threat to deep neural networks, especially in black-box scenarios where transferability plays a key role. Targeted transfer attacks, where an attacker induces a specific misclassification on an unseen black-box model, remains significantly more challenging than non-targeted attacks. We attribute this gap to model discrepancies between surrogate and target models, including mismatches in feature representations, classifier heads, and Jacobians. To address these challenges, we define a unified uncertainty set capturing these model discrepancies and propose a principled robust objective over this set. While intractable in full form, this view leads to a tractable relaxation: the Targeted Attack toward Reliable Transferability (TART). TART integrates three components: (1) expectation over transforms to cover representation and Jacobian variability; (2) latent mixing to model attenuation and clean-feature leakage; and (3) feature matching}to guide perturbations toward semantically robust regions. Extensive experiments on ImageNet and CIFAR-10 show that TART consistently outperforms state-of-the-art transfer-based black-box targeted attacks, across both convolutional and transformer architectures. For example, when transferring from ResNet-50 to Swin-S on ImageNet, TART achieves a 42.7\% higher attack success rate than the strongest baseline. Our approach establishes a new benchmark for robust black-box adversarial evaluation.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 23877
Loading