VCoT-Pro: Advancing Visual Reasoning with a Large-scale Comprehensive Visual Chain-of-Thought Dataset

Lingxiao Li; Yifan Wang; Xinyan Gao; Chen Tang; Xiangyu Yue; Chenyu You

VCoT-Pro: Advancing Visual Reasoning with a Large-scale Comprehensive Visual Chain-of-Thought Dataset

Lingxiao Li, Yifan Wang, Xinyan Gao, Chen Tang, Xiangyu Yue, Chenyu You

08 Sept 2025 (modified: 25 Sept 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Visual-CoT, VLM, LLM, dataset

Abstract: Chain-of-Thought (CoT) prompting has emerged as a powerful technique for eliciting complex reasoning in Large Language Models (LLMs). However, its potential within multimodal large language models (MLLMs) remains largely unrealized. A primary bottleneck is the lack of suitable datasets and benchmarks: existing visual-CoT resources are often limited in scale and diversity, or fail to capture the human-like, spatially-aware reasoning required for genuine visual understanding. To address these limitations, we introduce VisCoT-Pro, a large-scale and comprehensive benchmark designed to advance visual CoT reasoning. Our benchmark comprises two key components: (1) the main VisCoT-Pro dataset with 506k examples covering four domains, featuring multi-round, human-like step-by-step supervision that is substantially larger and more detailed than prior resources, and (2) VisCoT-Pro-Max, a 165k subset with richer step rationales and 3D grounding via depth-informed annotations, produced with stronger GPT-4.1-series guidance. We conduct extensive experiments on the state-of-the-art Qwen2.5-VL model. Training on VisCoT-Pro not only yields substantial improvements in the model's intrinsic step-by-step visual reasoning capabilities but also demonstrates remarkable generalization, significantly boosting performance on existing academic benchmarks. This highlights our dataset's ability to equip VLMs with robust, transferable reasoning skills, enabling them to better understand and think about the visual world. We release VisCoT-Pro as a foundational resource, providing the community with both a high-quality training corpus and a reliable benchmark to catalyze future research in visual CoT.

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2026/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 3201

Loading