A comprehensive survey of Vision-Language Models: Pretrained models, fine-tuning, prompt engineering, adapters, and benchmark datasets

Published: 01 Jan 2026, Last Modified: 07 Nov 2025Inf. Fusion 2026EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Highlights•Comprehensive analysis of VLM components like tuning, prompts, and datasets.•Explores adapter-based tuning and low-resource learning for efficiency gains.•Covers advances in contrastive pre-training and prompt engineering methods.•Discusses challenges in benchmarking, data diversity, and annotation quality.•Highlights future VLM challenges, including ethics, scaling, and adaptation.
Loading