Keywords: MLLM; MultiModal;Visual Reasoning
TL;DR: Empowering Multimodal Large Language Models with Evol-Instruct
Abstract: The development of Multimodal Large Language Models (MLLMs) has seen significant advancements with increasing demands in various fields (e.g., multimodal
agents, embodied intelligence). While model-driven approaches attempt to enhance MLLMs capabilities through diverse architectures, the gains have become
increasingly marginal. Conversely, data-driven methods, which scale up image-text
instruction data, are more effective but face limited data diversity and complexity
challenges. The absence of high-quality data constitutes a significant development
barrier for MLLMs. To address the data quality bottleneck, we propose MMEvol, a
novel multimodal instruction data evolution framework. This framework iteratively
improve data quality through a refined combination of fine-grained perception, cognitive reasoning, and interaction evolution, generating a more complex and diverse
image-text instruction dataset that empowers MLLMs with enhanced capabilities.
Beginning with an initial set of instructions, SEED-163K, we utilize MMEvol to
systematically broaden the diversity of instruction types, extend visual reasoning
steps to improve cognitive reasoning abilities, and thoroughly explore fine-grained
information within images to enhance visual understanding and robustness. To
comprehensively evaluate the effectiveness of our approach, we conduct extensive
qualitative analysis and quantitative experiments across 13 vision-language tasks.
Compared to baseline models trained with the initial seed data, the results demonstrate that our method achieves an average accuracy improvement of 3.1 percentage
points. Furthermore, our approach reaches state-of-the-art (SOTA) performance in
nine tasks using significantly less data compared to state-of-the-art models.
Primary Area: applications to computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 643
Loading