Bridging the Writing Manner Gap in Visual Instruction Tuning by Creating LLM-aligned Instructions

ACL ARR 2024 June Submission58 Authors

05 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: In the realm of Large Multi-modal Models (LMMs), the instruction quality during the visual instruction tuning stage significantly influences the performance of modality alignment. In this paper, we assess the instruction quality from a unique perspective termed Writing Manner, which encompasses the selection of vocabulary, grammar, and sentence structure to convey specific semantics. We argue there exists a substantial writing manner gap between the visual instructions and the inner Large Language Models (LLMs) of LMMs. This gap causes the well-trained inner LLMs to deviate from their original writing styles, leading to capability degradation of both LMMs and inner LLMs. To bridge the writing manner gap while preserving the original semantics, we propose directly leveraging the inner LLM to align the writing manner of soft-format visual instructions with that of the inner LLM itself, resulting in novel LLM-aligned instructions. We develop a novel perplexity-based indicator to quantitatively assess the writing manner gap, and corresponding results show that our approach successfully minimizes this gap. By utilizing LLM-aligned instructions, the baseline models LLaVA-7B and QwenVL demonstrate enhanced resistance to hallucinations and non-trivial comprehensive improvements across all $15$ visual and language benchmarks.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: multimodality,vision question answering
Contribution Types: NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 58
Loading