Diving into Self-Evolve Training for Multimodal Reasoning

Wei Liu; Junlong Li; Xiwen Zhang; Fan Zhou; Yu Cheng; Junxian He

Diving into Self-Evolve Training for Multimodal Reasoning

Wei Liu, Junlong Li, Xiwen Zhang, Fan Zhou, Yu Cheng, Junxian He

24 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Multimodal Models, Large Language Models, Reinforcement Learning, Multimodal Reasoning, Self-Evolve

Abstract: Reasoning ability is essential for Large Multimodal Models (LMMs). In the absence of multimodal chain-of-thought annotated data, self-evolving training, where the model learns from its own outputs, has emerged as an effective and scalable approach for enhancing reasoning abilities. Despite its growing usage, a comprehensive understanding of self-evolving training, particularly in the context of multimodal reasoning, remains limited. In this paper, we delve into the intricacies of self-evolving training for multimodal reasoning, pinpointing three key factors: $\textbf{Training Method}$, $\textbf{Reward Model}$, and $\textbf{Prompt Variation}$. We systematically examine each factor and explore how various configurations affect the training's effectiveness. Our analysis leads to a set of best practices for each factor, aimed at optimizing multimodal reasoning. Furthermore, we explore the $\textbf{Self-Evolution Dynamics}$ during training and the impact of automatic balancing mechanisms in boosting performance. After all the investigations, we present a final recipe for self-evolving training in multimodal reasoning, encapsulating these design choices into a framework we call M-STAR ($\textbf{M}$ultimodal $\textbf{S}$elf-evolving $\textbf{T}$r$\textbf{a}$ining for $\textbf{R}$easoning), built on MiniCPM-V 2.5. M-STAR achieves 59.5% accuracy on MathVista, surpassing the pre-evolved model by 6.9% absolutely without using additional human annotations. We believe this study fills a significant gap in the understanding of self-evolving training for multimodal reasoning and offers a robust framework for future research. Our policy and reward models, as well as the collected data, will be released to facilitate further investigation in multimodal reasoning.

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 3772

Loading