ViFT: Towards Visual Instruction-Free Fine-tuning for Large Vision-Language Models

ViFT: Towards Visual Instruction-Free Fine-tuning for Large Vision-Language Models

ACL ARR 2025 May Submission2989 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Visual instruction tuning has become the predominant technology in eliciting the multimodal task-solving capabilities of large vision-language models (LVLMs). Despite the success, as visual instructions require images as the input, it would leave the gap in inheriting the task-solving capabilities from the backbone LLMs, and make it costly to collect a large-scale high-quality dataset. To address it, we propose ViFT, a visual instruction-free fine-tuning framework for LVLMs. In ViFT, we only require the text-only instructions and image caption data during training, to separately learn the task-solving and visual perception abilities. During inference, we extract and combine the representations of the text and image inputs, for fusing the two abilities to fulfill multimodal tasks. Experimental results demonstrate that ViFT can achieve state-of-the-art performance on several downstream benchmarks, with rather less training data. Our code and data will be publicly released.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: Vision question answering,Data-efficient training

Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Data resources

Languages Studied: English

Keywords: Large Vision-Language Models, Representation Engineering

Submission Number: 2989

Loading