Are Vision Transformers Always More Robust Than Convolutional Neural Networks?

Francesco Pinto; Philip Torr; Puneet K. Dokania

Are Vision Transformers Always More Robust Than Convolutional Neural Networks?

Francesco Pinto, Philip Torr, Puneet K. Dokania

Published: 02 Dec 2021, Last Modified: 05 May 2023NeurIPS 2021 Workshop DistShift PosterReaders: Everyone

Keywords: visual transformers, big transfer, transfer learning, data-shift, out-of-distribution detection, calibration

Abstract: Since Transformer architectures have been popularised in Computer Vision, several papers started analysing their properties in terms of calibration, out-of-distribution detection and data-shift robustness. Most of these papers conclude that Transformers, due to some intrinsic properties (presumably the lack of restrictive inductive biases and the computationally intensive self-attention mechanism), outperform Convolutional Neural Networks (CNNs). In this paper we question this conclusion: in some relevant cases, CNNs, with a pre-training and fine-tuning procedure similar to the one used for transformers, exhibit competitive robustness. To fully understand this behaviour, our evidence suggests that researchers should focus on the interaction between pre-training, fine-tuning and the considered architectures rather than on intrinsic properties of Transformers. For this reason, we present some preliminary analyses that shed some light on the impact of pre-training and fine-tuning on out-of-distribution detection and data-shift.

TL;DR: Evidence about the sufficiency of convolutional inductive biases for robustness to data-shift and good OOD detection performance. Comparing fine-tuning effects on uncertainty properties and robustness.

1 Reply

Loading