Progressive Multimodal Chain-of-Thought Tuning for Vision-Indispensable Reasoning

ACL ARR 2024 June Submission2577 Authors

15 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recent advancements in multimodal large language models (MLLMs) have showcased their impressive capabilities in multimodal understanding and generation. Nevertheless, current open-source MLLMs still encounter challenges in complex reasoning and problem solving, especially in vision-indispensable scenarios. In this paper, we present ViLamr, an MLLM tailored for vision-indispensable reasoning. To endow ViLamr with powerful reasoning capabilities, we initially construct a multimodal instruction-following dataset, MCoT-Instruct, featuring 266K high-quality chain-of-thought responses. Subsequently, we equip ViLamr with a novel connector to selectively integrate different visual features and facilitate alignment between correlated vision and language content. Finally, we fine-tune ViLamr on MCoT-Instruct with a meticulously designed reasoning progressive-enhancement tuning scheme, encouraging ViLamr to follow the cognitive process of ``understanding before reasoning''. Experiments on multiple multimodal benchmarks and datasets demonstrate the effectiveness of ViLamr and the contribution of MCoT-Instruct in bolstering MLLM reasoning capabilities.
Paper Type: Long
Research Area: Question Answering
Research Area Keywords: multimodal QA, reasoning
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 2577
Loading