A comprehensive survey of Vision-Language Models: Pretrained models, fine-tuning, prompt engineering, adapters, and benchmark datasets

Sufyan Danish, Abolghasem Sadeghi-Niaraki, Samee Ullah Khan, Lien Minh Dang, Lilia Tightiz, Hyeonjoon Moon

Published: 2026, Last Modified: 07 Nov 2025Inf. Fusion 2026EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Highlights•Comprehensive analysis of VLM components like tuning, prompts, and datasets.•Explores adapter-based tuning and low-resource learning for efficiency gains.•Covers advances in contrastive pre-training and prompt engineering methods.•Discusses challenges in benchmarking, data diversity, and annotation quality.•Highlights future VLM challenges, including ethics, scaling, and adaptation.

External IDs:dblp:journals/inffus/DanishSKDTM26