Transformed CNNs: recasting pre-trained convolutional layers with self-attention

Stéphane d'Ascoli; Levent Sagun; Giulio Biroli; Ari S. Morcos

Transformed CNNs: recasting pre-trained convolutional layers with self-attention

Stéphane d'Ascoli, Levent Sagun, Giulio Biroli, Ari S. Morcos

Published: 28 Jan 2022, Last Modified: 13 Feb 2023ICLR 2022 SubmittedReaders: Everyone

Keywords: convolutional networks, transformers, hybrid, fine-tuning

Abstract: Vision Transformers (ViT) have recently emerged as a powerful alternative to convolutional networks (CNNs). Although hybrid models attempt to bridge the gap between these two architectures, the self-attention layers they rely on induce a strong computational bottleneck, especially at large spatial resolutions. In this work, we explore the idea of reducing the time spent training these layers by initializing them from pre-trained convolutional layers. This enables us to transition smoothly from any pre-trained CNN to its functionally identical hybrid model, called Transformed CNN (T-CNN). With only 50 epochs of fine-tuning, the resulting T-CNNs demonstrate significant performance gains over the CNN as well as substantially improved robustness. We analyze the representations learnt by theT-CNN, providing deeper insights into the fruitful interplay between convolutions and self-attention.

One-sentence Summary: We reparametrize pre-trained convolutional layers as self-attention layers to improve their robustness.

Supplementary Material: zip

5 Replies

Loading