Which Backbone to Use: A Resource-efficient Domain Specific Comparison for Computer Vision

Published: 18 Mar 2025, Last Modified: 18 Mar 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: For computer vision applications on small, niche, and proprietary datasets, fine-tuning a neural network (NN) backbone that is pre-trained on a large dataset, such as the ImageNet, is a common practice. However, it is unknown whether the backbones that perform well on large datasets, such as vision transformers, are also the right choice for fine-tuning on smaller custom datasets. The present comprehensive analysis aims to aid machine learning practitioners in selecting the most suitable backbone for their specific problem. We systematically evaluated multiple lightweight, pre-trained backbones under consistent training settings across a variety of domains spanning natural, medical, deep space, and remote sensing images. We found that even though attention-based architectures are gaining popularity, they tend to perform poorly compared to CNNs when fine-tuned on small amounts of domain-specific data. We also observed that certain CNN architectures consistently perform better than others when controlled for network size. Our findings provide actionable insights into the performance trade-offs and effectiveness of different backbones for a broad spectrum of computer vision domains.
Submission Length: Regular submission (no more than 12 pages of main content)
Code: https://github.com/pranavphoenix/Backbones
Assigned Action Editor: ~Yu-Xiong_Wang1
Submission Number: 3611
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview