Keywords: fine-tuning, SGD, freezing layers, distribution shift
TL;DR: SGD can do worse than AdamW under distribution shifts, but simple changes make SGD competitive
Abstract: SGD (with momentum) and AdamW are the two most commonly used optimizers for fine-tuning large neural networks in computer vision. When the two methods perform the same, SGD is preferable because it uses less memory and is more efficient than AdamW. However, when evaluating on downstream tasks that differ significantly from pretraining, we find that across five popular benchmarks SGD fine-tuning gets substantially lower accuracies than AdamW on many modern vision models such as Vision Transformers and ConvNeXts---especially out-of-distribution (OOD). We find that such large gaps arise in instances where the fine-tuning gradients in the first (``embedding'') layer are much larger than the rest of the model. Our analysis suggests an easy fix: if we simply freeze the embedding layer (0.7\% of the parameters), SGD performs competitively with AdamW while using less memory across a suite of benchmarks. Our insights lead to state-of-the-art accuracies on popular distribution shift benchmarks: WILDS-FMoW, WILDS-Camelyon, BREEDS-Living-17, Waterbirds, and DomainNet.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Deep Learning and representational learning
19 Replies
Loading