ConvNet vs Transformer, Supervised vs CLIP: Beyond ImageNet Accuracy

16 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: Computer Vision, Neural Networks, Model Architectures
TL;DR: We explore vision models' unique behaviors in aspects other than standard accuracy.
Abstract: Modern computer vision offers a great variety of models to practitioners, and selecting a model from multiple options for specific applications can be challenging. Conventionally, competing model architectures and training protocols are compared by their ImageNet accuracy. However, this single metric does not fully capture performance nuances critical for specialized tasks. In this work, we conduct an in-depth comparative analysis of model behaviors beyond ImageNet accuracy for four leading models: ConvNeXt and Vision Transformer (ViT), across supervised and CLIP training objectives. Although selected models have similar ImageNet accuracies and compute requirements, we find that they differ in many other aspects — types of mistakes, output calibration, transferability, and feature invariance, among others. This diversity in model characteristics, not captured by traditional metrics, provides insights for better model selection to meet specific goals. Our research highlights the need for more nuanced analysis when choosing among models.
Primary Area: visualization or interpretation of learned representations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 761
Loading