Keywords: MLLM, Data Generation, Geometry
Abstract: Multimodal large language models (MLLMs) have various practical applications that demand strong reasoning abilities. Despite recent advancements, these models still struggle to solve complex geometric problems. A key challenge stems from the lack of high-quality image-text pair datasets for understanding geometric images. Furthermore, most symbolic data synthesis pipelines typically fail to generalize to questions beyond their predefined templates. In this paper, we bridge this gap by introducing a complementary process of Reinforcement Learning with Verifiable Rewards (RLVR) into the data generation pipeline. By adopting accuracy-guided RLVR to refine captions for symbolically synthesized geometric images, our pipeline successfully captures the key features of geometry problem-solving. This enables better task generalization and yields non-trivial improvements. Furthermore, even in out-of-distribution scenarios, the generated dataset GeoReasoning-10K achieves non-trivial performance gains, yielding accuracy improvements of 2.8\%–4.8\% in non-geometric subtasks of MathVista and MathVerse. This generalization ability is further validated in MMMU, where significant improvements of 2.4\%–3.9\% in Art \& Design and Tech \& Engineering tasks are observed.
Primary Area: datasets and benchmarks
Submission Number: 13527
Loading