Parameter-Efficient Fine-Tuning for Vision-Language Models: The Post-Transformer Evolution

Published: 20 May 2026, Last Modified: 19 Mar 2026ICIPROBEveryoneCC BY 4.0
Abstract: The rapid proliferation of large-scale VisionLanguage Models (VLMs) has revolutionized multimodal artificial intelligence, enabling unprecedented capabilities in crossmodal understanding and generation. However, the substantial computational and memory requirements for fine-tuning these billion-parameter models present significant deployment challenges, particularly for resource constrained environments like mobile robots and edge devices. This survey provides a comprehensive analysis of Parameter-Efficient Fine-Tuning (PEFT) techniques tailored for VLMs in the post-Transformer era (2021-2025). PEFT methods are systematically categorized into three mechanistic paradigms: input-level adaptation (prompting), feature-level adaptation (adapters), and weight-level adaptation (reparameterization). Representative techniques, including CoOp, CoCoOp, MaPLe, CLIP-Adapter, Tip-Adapter, LoRA, DoRA, and PiSSA, are analyzed. These methods are critically evaluated regarding parameter efficiency, convergence, latency, catastrophic forgetting, and alignment preservation. Comparative benchmarking on ImageNet, VQAv2, and MMBench is used to isolate optimal strategies for distinct application needs. This analysis extends into specialized fields, critically examining adaptations for medical imaging, remote sensing, and video understanding, alongside privacy preserving federated learning. It is concluded that while competitive performance is offered by methods like DoRA and Tip-Adapter-F, the optimal strategy is critically dependent on the specific architecture, task complexity, and deployment constraints.
Loading