Keywords: Deep neural network, convolutional neural network, vision transformer, multi-layer perceptron, image classification
TL;DR: This paper presents a simple and universal head structure to improve the representation learning of deep neural networks for image classification.
Abstract: A modern deep neural network (DNN) for image classification typically consists of two parts: a backbone for feature extraction, and a head for feature encoding and class predication. We notice that the head structures of prevailing DNNs share a similar processing pipeline, exploiting global feature dependencies while disregarding local ones. Instead, this paper presents $N$on-gl$O$bal $A$ttentive $H$ead (NOAH), a simple and universal head structure, to improve the learning capacity of DNNs. NOAH relies on a novel form of attention dubbed pairwise object category attention, which models dense local-to-global feature dependencies via a concise association of feature split, interaction and aggregation operations. As a drop-in design, NOAH can replace existing heads of many DNNs, and meanwhile, maintains almost the same model size and similar model efficiency. We validate the efficacy of NOAH mainly on the large-scale ImageNet dataset with various DNN architectures that span convolutional neural networks, vision transformers and multi-layer perceptrons when training from scratch. Without bells and whistles, experiments show that: (a) NOAH can significantly boost the performance of lightweight DNNs, e.g., bringing $3.14%$|$5.30%$|$1.90%$ top-1 accuracy improvement for MobileNetV2 ($0.5\times$)|Deit-Tiny ($0.5\times$)|gMLP-Tiny ($0.5\times$); (b) NOAH can generalize well on relatively large DNNs, e.g., bringing $1.02%$|$0.78%$|$0.91%$ top-1 accuracy improvement for ResNet50|Deit-Small|MLP-Mixer-Small; (c) NOAH can still bring acceptable performance gains to large DNNs (having over 50 million parameters), e.g., $0.41%$|$0.37%$|$0.35%$ top-1 accuracy improvement for ResNet152|Deit-Base|MLP-Mixer-Base. Besides, NOAH also retains its effectiveness in the aggressive training regime (e.g., a ResNet50 model with NOAH reaches $79.32%$ top-1 accuracy, yielding $0.88%$ gain) and other image classification tasks. Code is provided for results reproduction.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Deep Learning and representational learning
Supplementary Material: zip
5 Replies
Loading