How to Adapt Your Large-Scale Vision-and-Language Model

Konwoo Kim; Michael Laskin; Igor Mordatch; Deepak Pathak

How to Adapt Your Large-Scale Vision-and-Language Model

Konwoo Kim, Michael Laskin, Igor Mordatch, Deepak Pathak

Published: 28 Jan 2022, Last Modified: 13 Feb 2023ICLR 2022 SubmittedReaders: Everyone

Keywords: transfer learning, fine-tuning, layernorm, CLIP, prompt-tuning, adaptation, zero-shot, pretraining

Abstract: Pre-training large-scale vision and language models (e.g. CLIP) has shown promising results in representation and transfer learning. We investigate the question of how to efficiently adapt these models to downstream tasks. For image classification, linear probes have been the standard for ease of use and efficiency, while for language, other approaches like prompt tuning have emerged. We analyze several fine-tuning methods across a diverse set of image classification tasks across two spectra investigating the amount and similarity of downstream data to that of pretraining one. We find that just tuning LayerNorm parameters is a surprisingly effective baseline across the board. We further demonstrate a simple yet effective strategy that combines LayerNorm-tuning with general fine-tuning methods to improve their performance and benchmark them on few-shot adaption and distribution shift tasks. Finally, we provide an empirical analysis and recommend general recipes for efficient transfer learning of vision and language models. Website at https://sites.google.com/view/adapt-large-scale-models

One-sentence Summary: We present a thorough analysis of different methods on how to adapt large-scale pretrained vision-and-language models to several downstream classification tasks, and find that just tuning LayerNorm is an effective fine-tuning baseline.

10 Replies

Loading