Abstract: In the realm of Large Multi-modal Models (LMMs), the ultimate modality alignment is constrained by the quality of instructions in Supervised Fine-Tuning (SFT) phase. In this paper, we assess the instruction quality from a unique perspective called Writing Manner, which refers to the writing habits on choosing words, grammar, and sentence structure to express certain semantics. We argue that there exists severe writing manner gap between the visual instructions and the Large Language Models (LLMs) within LMMs. During the SFT phase, the more pronounced the writing manner gap, the more the inner LLM is updated, leading to capability degradation of both inner LLM and LMM. To bridge the writing manner gap, under the promise of not changing original semantics, we propose to directly exploit the inner LLM for aligning the writing manner of soft-format visual instructions with that of the inner LLM itself, which yields novel LLM-aligned instructions. By utilizing LLM-aligned instructions, the two baselines LLaVA-7B and LLaVA-13B are enhanced on all 12 benchmarks and 10/12 benchmarks, respectively. Furthermore, the evaluation results on the inner LLM demonstrate that the proposed strategy can effectively maintain the consistency and capabilities of the inner LLM.
Paper Type: long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: NLP engineering experiment, Data resources
Languages Studied: English
0 Replies
Loading