Enhancing Vision-Language Reasoning via Reinforcement Learning with Scalable Multimodal QA Synthesis

17 Sept 2025 (modified: 13 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: multimodal large language models, visual-language reasoning
Abstract: Building on the success of text-based reasoning models like DeepSeek-R1, extending these capabilities to multimodal reasoning holds great promise. While recent works have attempted to adapt DeepSeek-R1-style reinforcement learning (RL) training paradigms to multimodal large language models (MLLM), focusing on domain-specific tasks like math and visual perception, a critical question remains: How can we enhance visual-language reasoning through RL for different domains? To address this challenge, we make three key efforts: (1) A novel Scalable Multimodal QA Synthesis pipeline that autonomously generates domain-aware, reasoning-centric question-answer (QA) pairs directly from images across different domains. (2) The open-source WeThink dataset containing over 120K multimodal QA pairs with annotated reasoning paths, curated from 18 diverse dataset sources and covering various question domains. (3) A simple baseline incorporating a hybrid reward mechanism that combines rule-based verification with model-based assessment to optimize RL training efficiency across different task domains. Through comprehensive exploration of RL on our dataset, we demonstrate that the WeThink dataset significantly improves performance across diverse MLLM benchmarks. Furthermore, we highlight that our automated data pipeline can continuously increase data diversity, further boosting model performance. The anonymous code and dataset are available at \url{https://anonymous.4open.science/r/WeThink-7C9A} and \url{https://huggingface.co/datasets/WeThink/WeThink-Multimodal-Reasoning-120K}.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 8511
Loading