Generalizable Geometric Image Caption Synthesis

Yue Xin; Wenyuan Wang; Rui Pan; Ruida WANG; BingXu Meng; Renjie Pi; Shizhe Diao; Tong Zhang

Generalizable Geometric Image Caption Synthesis

Yue Xin, Wenyuan Wang, Rui Pan, Ruida WANG, BingXu Meng, Renjie Pi, Shizhe Diao, Tong Zhang

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: MLLM, Data Generation, Geometry

Abstract: Multimodal large language models (MLLMs) have various practical applications that demand strong reasoning abilities. Despite recent advancements, these models still struggle to solve complex geometric problems. A key challenge stems from the lack of high-quality image-text pair datasets for understanding geometric images. Furthermore, most symbolic data synthesis pipelines typically fail to generalize to questions beyond their predefined templates. In this paper, we bridge this gap by introducing a complementary process of Reinforcement Learning with Verifiable Rewards (RLVR) into the data generation pipeline. By adopting accuracy-guided RLVR to refine captions for symbolically synthesized geometric images, our pipeline successfully captures the key features of geometry problem-solving. This enables better task generalization and yields non-trivial improvements. Furthermore, even in out-of-distribution scenarios, the generated dataset GeoReasoning-10K achieves non-trivial performance gains, yielding accuracy improvements of 2.8\%–4.8\% in non-geometric subtasks of MathVista and MathVerse. This generalization ability is further validated in MMMU, where significant improvements of 2.4\%–3.9\% in Art \& Design and Tech \& Engineering tasks are observed.

Primary Area: datasets and benchmarks

Submission Number: 13527

Loading