Transformed CNNs: recasting pre-trained convolutional layers with self-attentionDownload PDF

Published: 28 Jan 2022, Last Modified: 13 Feb 2023ICLR 2022 SubmittedReaders: Everyone
Keywords: convolutional networks, transformers, hybrid, fine-tuning
Abstract: Vision Transformers (ViT) have recently emerged as a powerful alternative to convolutional networks (CNNs). Although hybrid models attempt to bridge the gap between these two architectures, the self-attention layers they rely on induce a strong computational bottleneck, especially at large spatial resolutions. In this work, we explore the idea of reducing the time spent training these layers by initializing them from pre-trained convolutional layers. This enables us to transition smoothly from any pre-trained CNN to its functionally identical hybrid model, called Transformed CNN (T-CNN). With only 50 epochs of fine-tuning, the resulting T-CNNs demonstrate significant performance gains over the CNN as well as substantially improved robustness. We analyze the representations learnt by theT-CNN, providing deeper insights into the fruitful interplay between convolutions and self-attention.
One-sentence Summary: We reparametrize pre-trained convolutional layers as self-attention layers to improve their robustness.
Supplementary Material: zip
5 Replies

Loading