Keywords: LLM/VLM, Benchmark, Agentic System, Multi-view Reasoning
TL;DR: We benchmarked VLMs' multi-view reasoning ability, discovered common failure patterns and biases, proposed a multi-agent system that improves the performance.
Abstract: Recent advances in Vision–Language Models (VLMs) have opened new possibilities for complex spatial reasoning. Benchmarks for VLMs largely assess single- or limited-view perception, leaving untested the core ability to integrate observations across viewpoints into a coherent 3D understanding. We introduce MVBench, a benchmark expressly designed to evaluate multi-view integration for holistic 3D scene comprehension. MVBench is paired with a highly extensible data-generation pipeline that supports plug-and-play 3D assets (synthetic or real), configurable distractors, and flexible camera positions and orientations, enabling researchers to readily instantiate new datasets by swapping assets or altering viewpoint configurations. Beyond benchmarking, MVBench serves as a fundamental diagnostic that VLMs should pass before being deployed as agents operating 3D software for downstream tasks such as 3D assets generation and part assembly for mechanical engineering. We evaluate a broad set of frontier VLMs and uncover consistent failure modes: strong performance on 2D planar relations from a single image, but marked difficulty with 3D spatial relations and with aggregating information across views. We further identify biases in VLMs, including handling unconventional axis directions and sensitivity to object colorways and texture variations. Acknowledging these limitations, we propose ViewNavigator, a multi-agent framework that actively selects informative viewpoints, perceive, and fuses multi-view evidence through belief-updating. ViewNavigator improves the performances of diverse base models on MVBench by more than 50%. MVBench and its extensible pipeline are designed to equip researchers with a principled testbed for strengthening VLMs’ 3D scene understanding, paving the way for more capable VLM-based agents that can support a wide range of downstream 3D tasks.
Primary Area: datasets and benchmarks
Submission Number: 19241
Loading