Efficient Adaptation of Large Vision-Language Models: Transfer Learning Methods and Applications

TMLR Paper6905 Authors

08 Jan 2026 (modified: 22 May 2026)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Pre-trained large vision-language models (VLMs) have become the dominant choice for handling vision-language tasks, covering from multimodal reasoning to text-image generation. However, these models heavily depend on large-scale training datasets, primarily composed of image-text pairs sourced from web data, which are typically confined to general domains rather than specific downstream tasks. Given the scarcity of data in such specialized domains, transfer learning emerges as a remedy, enabling the adaptation of a model's preexisting knowledge to new tasks with limited data, thereby mitigating the reliance on extensive datasets. Following the current trend of the transfer learning application with vision-language tasks, we provide a systematic study of existing transfer learning techniques adopted for vision-language models, including: (1) a summary of the existing state-of-the-art VLMs, (2) a comprehensive taxonomy of transfer learning approaches for VLMs, (3) the discussion of real-world applications of transfer learning methods for VLMs, (4) a summary of commonly used vision-language dataset and benchmarks in variant vision-language tasks.
Submission Type: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Massimiliano_Mancini1
Submission Number: 6905
Loading