Bridging the Writing Manner Gap in Visual Instruction Tuning by Creating LLM-aligned InstructionsDownload PDF

Anonymous

16 Dec 2023ACL ARR 2023 December Blind SubmissionReaders: Everyone
Abstract: In the realm of Large Multi-modal Models (LMMs), the ultimate modality alignment is constrained by the quality of instructions in Supervised Fine-Tuning (SFT) phase. In this paper, we assess the instruction quality from a unique perspective called Writing Manner, which refers to the writing habits on choosing words, grammar, and sentence structure to express certain semantics. We argue that there exists severe writing manner gap between the visual instructions and the Large Language Models (LLMs) within LMMs. During the SFT phase, the more pronounced the writing manner gap, the more the inner LLM is updated, leading to capability degradation of both inner LLM and LMM. To bridge the writing manner gap, under the promise of not changing original semantics, we propose to directly exploit the inner LLM for aligning the writing manner of soft-format visual instructions with that of the inner LLM itself, which yields novel LLM-aligned instructions. By utilizing LLM-aligned instructions, the two baselines LLaVA-7B and LLaVA-13B are enhanced on all 12 benchmarks and 10/12 benchmarks, respectively. Furthermore, the evaluation results on the inner LLM demonstrate that the proposed strategy can effectively maintain the consistency and capabilities of the inner LLM.
Paper Type: long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: NLP engineering experiment, Data resources
Languages Studied: English
0 Replies

Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview