Keywords: visual reasoning, benchmark, thinking with images, MLLM
Abstract: Models like OpenAI-o3 pioneer visual grounded reasoning by dynamically ref-
erencing visual regions, just like human “thinking with images”. However, no
benchmark exists to evaluate these capabilities holistically. To bridge this gap, we
propose TreeBench (Traceable Evidence Evaluation Benchmark), a diagnostic
benchmark built on three principles: (1) focused visual perception of subtle targets
in complex scenes, (2) traceable evidence via bounding box evaluation, and (3)
second-order reasoning to test object interactions and spatial hierarchies beyond
simple object localization. Prioritizing images with dense objects, we initially
sample 1K high-quality images from SA-1B, and incorporate eight LMM experts
to manually annotate questions, candidate options, and answers for each image.
After three stages of quality control, TreeBench consists of 405 challenging vi-
sual question-answering pairs, even the most advanced models struggle with this
benchmark, where none of them reach 60% accuracy, e.g., OpenAI-o3 scores only
54.87. Furthermore, we introduce TreeVGR (Traceable Evidence Enhanced Visual
Grounded Reasoning), a training paradigm to supervise localization and reasoning
jointly with reinforcement learning, enabling accurate localizations and explainable
reasoning pathways. Initialized from Qwen2.5-VL-7B, it improves V* Bench
(+16.8), MME-RealWorld (+12.6), and TreeBench (+13.4), proving traceability is
key to advancing vision-grounded reasoning. The code and data will be released.
Primary Area: datasets and benchmarks
Submission Number: 1547
Loading