Keywords: Large Multimodal Models, Large Language Models, Reinforcement Learning, Multimodal Reasoning, Self-Evolve
Abstract: Reasoning ability is essential for Large Multimodal Models (LMMs).
In the absence of multimodal chain-of-thought annotated data, self-evolving training, where the model learns from its own outputs, has emerged as an effective and scalable approach for enhancing reasoning abilities.
Despite its growing usage, a comprehensive understanding of self-evolving training, particularly in the context of multimodal reasoning, remains limited. In this paper, we delve into the intricacies of self-evolving training for multimodal reasoning, pinpointing three key factors: $\textbf{Training Method}$, $\textbf{Reward Model}$, and $\textbf{Prompt Variation}$. We systematically examine each factor and explore how various configurations affect the training's effectiveness. Our analysis leads to a set of best practices for each factor, aimed at optimizing multimodal reasoning.
Furthermore, we explore the $\textbf{Self-Evolution Dynamics}$ during training and the impact of automatic balancing mechanisms in boosting performance. After all the investigations, we present a final recipe for self-evolving training in multimodal reasoning, encapsulating these design choices into a framework we call M-STAR ($\textbf{M}$ultimodal $\textbf{S}$elf-evolving $\textbf{T}$r$\textbf{a}$ining for $\textbf{R}$easoning), built on MiniCPM-V 2.5.
M-STAR achieves 59.5% accuracy on MathVista, surpassing the pre-evolved model by 6.9% absolutely without using additional human annotations.
We believe this study fills a significant gap in the understanding of self-evolving training for multimodal reasoning and offers a robust framework for future research. Our policy and reward models, as well as the collected data, will be released to facilitate further investigation in multimodal reasoning.
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 3772
Loading