Beyond Adversarial Robustness: Breaking the Robustness-Alignment Trade-off in Object Recognition

Published: 06 Mar 2025, Last Modified: 02 May 2025ICLR 2025 Re-Align Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: long paper (up to 10 pages)
Domain: machine learning
Abstract: A well-known limitation of deep neural networks (DNNs) is their sensitivity to adversarial attacks. That DNNs can easily be fooled by minute image perturbations imperceptible to humans has long been considered a significant vulnerability of deep learning, which may eventually force a shift towards modeling paradigms that are faithful to biology. Nevertheless, the ever-evolving capabilities of DNNs have largely eclipsed these early concerns. Do adversarial perturbations continue to pose a threat to DNNs? Here, we investigate whether DNN improvements in image categorization have led to concurrent improvements in robustness to adversarial perturbations. We evaluated DNN adversarial robustness in two ways. First, we measured the tolerance of DNNs to adversarial perturbations by recording the norm of the smallest image perturbation needed to change a model's decision using a standard ``minimum-norm'' robustness metric. Second, we measured alignment of perturbations and the degree to which they target pixels that are diagnostic for human observers. We uncover a surprising trade-off: as DNNs have improved on ImageNet, they have grown more tolerant to adversarial perturbations. However, these perturbations are also progressively less aligned with features critical to humans for object recognition. To better understand the source of this trade-off, we turn to DNN training methods that have previously been reported to align DNNs with human vision, namely adversarial training and harmonization. Our results show that both methods improve this trade-off, significantly increasing the tolerance and alignment of DNN perturbations with human visual features. Harmonized models, unlike adversarially trained ones, are also able to maintain their ImageNet accuracy in the process. Our findings suggest that, the vulnerability of DNNs to adversarial perturbations can be at least partially mitigated by augmenting the trends in model scaling that are driving development today with training routines that align models with biological intelligence. We release our code and data to support continued progress toward studying the adversarial behavior of DNNs.
Submission Number: 50
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview