BrailleVision: Text Instruction Tuning of LLMs to Improve Visual Skills

Rohit Gupta; praveen tirupattur; Mamshad Nayeem Rizve; Mubarak Shah

BrailleVision: Text Instruction Tuning of LLMs to Improve Visual Skills

Rohit Gupta, praveen tirupattur, Mamshad Nayeem Rizve, Mubarak Shah

22 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLMs, Vision-Language Models

TL;DR: finetuning text LLMs to improve their base visual skills

Abstract: Large Language Models (LLMs) have shown exceptional proficiency in natural language processing tasks. More recently, their potential is being explored in vision-centric applications. Current multimodal large language models (MLLMs) incorporate general-purpose LLMs through multimodal instruction tuning. These LLMs, however, lack prior vision centric text based training, potentially limiting their effectiveness. In this work, we propose a novel approach to enhance vision-related capabilities of general-purpose LLMs through instruction fine-tuning with vision-centric text data. Specifically, we curate a diverse dataset, BrailleVision-360K, to teach skills such as visual perception, abstraction, and spatio-temporal reasoning without the use of visual data, analogous to how Braille codes are used by the visually impaired. The dataset is constructed in an automated manner by utilizing LLMs, bootstrapping from existing datasets, and employing VLMs to improve quality. Next, to fine-tune an LLM with this dataset, we introduce Fine-SFT, a novel fine-tuning approach that improves upon standard supervised fine-tuning and preference optimization techniques. Our vision-specialized LLM shows significant performance gains in tasks such as visual classification and open vocabulary detection. Furthermore, when used as the `backbone' for an MLLM, our model outperforms existing LLMs on standard visual QA benchmarks while reducing hallucinations, highlighting the importance of vision-centric pretraining of LLMs in multimodal tasks.

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 2712

Loading