Keywords: similarity, representational similarity, functional similarity, adversarial robustness, universality
Abstract: The *modified universality hypothesis* proposed by Jones et al. (2022) suggests that adversarially robust models trained for a given task are highly similar. We revisit the hypothesis and test its generality. We find that predictive behavior does not converge with increasing robustness and thus is not universal. Further, with additional similarity measures, we uncover differences in the representations that were invisible with the measures used in prior work. While robust models tend to be more similar than standard models, robust models remain distinct in important aspects. Moreover, the importance of similarity measures when comparing representations is highlighted as the absolute level of similarity---and thus the assessment of universality---is heavily dependent on the measure used.
Supplementary Material: zip
Primary Area: interpretability and explainable AI
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 13901
Loading