Revisiting the Relation Between Robustness and Universality

Published: 06 Mar 2025, Last Modified: 20 Apr 2025ICLR 2025 Re-Align Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: long paper (up to 10 pages)
Domain: machine learning
Abstract: The _modified universality hypothesis_ proposed by Jones et al. (2022) suggests that adversarially robust models trained for a given task are highly similar. We revisit the hypothesis and test its generality. While we verify Jones' main claim of high representational similarity in specific settings, results are not consistent across different datasets. We also discover that predictive behavior does not converge with increasing robustness and thus is not universal. We find that differing predictions originate in the classification layer, but show that more universal predictive behavior can be achieved with simple retraining of the classifiers. Overall, our work points towards partial universality of neural networks in specific settings and away from notions of strict universality.
Submission Number: 19
Loading